Array Merge (Union) - ruby-on-rails

I have two array I need to merge, and using the Union (|) operator is PAINFULLY slow.. are there any other ways to accomplish an array merge?
Also, the arrays are filled with objects, not strings.
An Example of the objects within the array
#<Article
id: 1,
xml_document_id: 1,
source: "<article><domain>events.waikato.ac</domain><excerpt...",
created_at: "2010-02-11 01:32:46",
updated_at: "2010-02-11 01:41:28"
>
Where source is a short piece of XML.
EDIT
Sorry! By 'merge' I mean I need to not insert duplicates.
A => [1, 2, 3, 4, 5]
B => [3, 4, 5, 6, 7]
A.magic_merge(B) #=> [1, 2, 3, 4, 5, 6, 7]
Understanding that the integers are actually Article objects, and the Union operator appears to take forever

Here's a script which benchmarks two merge techniques: using the pipe operator (a1 | a2), and using concatenate-and-uniq ((a1 + a2).uniq). Two additional benchmarks give the time of concatenate and uniq individually.
require 'benchmark'
a1 = []; a2 = []
[a1, a2].each do |a|
1000000.times { a << rand(999999) }
end
puts "Merge with pipe:"
puts Benchmark.measure { a1 | a2 }
puts "Merge with concat and uniq:"
puts Benchmark.measure { (a1 + a2).uniq }
puts "Concat only:"
puts Benchmark.measure { a1 + a2 }
puts "Uniq only:"
b = a1 + a2
puts Benchmark.measure { b.uniq }
On my machine (Ubuntu Karmic, Ruby 1.8.7), I get output like this:
Merge with pipe:
1.000000 0.030000 1.030000 ( 1.020562)
Merge with concat and uniq:
1.070000 0.000000 1.070000 ( 1.071448)
Concat only:
0.010000 0.000000 0.010000 ( 0.005888)
Uniq only:
0.980000 0.000000 0.980000 ( 0.981700)
Which shows that these two techniques are very similar in speed, and that uniq is the larger component of the operation. This makes sense intuitively, being O(n) (at best), whereas simple concatenation is O(1).
So, if you really want to speed this up, you need to look at how the <=> operator is implemented for the objects in your arrays. I believe that most of the time is being spent comparing objects to ensure inequality between any pair in the final array.

Do you need the items to be in a specific order within the arrays? If not, you may want to check whether using Sets makes it faster.
Update
Adding to another answerer's code:
require "set"
require "benchmark"
a1 = []; a2 = []
[a1, a2].each do |a|
1000000.times { a << rand(999999) }
end
s1, s2 = Set.new, Set.new
[s1, s2].each do |s|
1000000.times { s << rand(999999) }
end
puts "Merge with pipe:"
puts Benchmark.measure { a1 | a2 }
puts "Merge with concat and uniq:"
puts Benchmark.measure { (a1 + a2).uniq }
puts "Concat only:"
puts Benchmark.measure { a1 + a2 }
puts "Uniq only:"
b = a1 + a2
puts Benchmark.measure { b.uniq }
puts "Using sets"
puts Benchmark.measure {s1 + s2}
puts "Starting with arrays, but using sets"
puts Benchmark.measure {s3, s4 = [a1, a2].map{|a| Set.new(a)} ; (s3 + s4)}
gives (for ruby 1.8.7 (2008-08-11 patchlevel 72) [universal-darwin10.0])
Merge with pipe:
1.320000 0.040000 1.360000 ( 1.349563)
Merge with concat and uniq:
1.480000 0.030000 1.510000 ( 1.512295)
Concat only:
0.010000 0.000000 0.010000 ( 0.019812)
Uniq only:
1.460000 0.020000 1.480000 ( 1.486857)
Using sets
0.310000 0.010000 0.320000 ( 0.321982)
Starting with arrays, but using sets
2.340000 0.050000 2.390000 ( 2.384066)
Suggests that sets may or may not be faster, depending on your circumstances (lots of merges or not many merges).

Using the Array#concat method will likely be a lot faster, according to my initial benchmarks using Ruby 1.8.7:
require 'benchmark'
def reset_arrays!
#array1 = []
#array2 = []
[#array1, #array2].each do |array|
10000.times { array << ActiveSupport::SecureRandom.hex }
end
end
reset_arrays! && puts(Benchmark.measure { #array1 | #array2 })
# => 0.030000 0.000000 0.030000 ( 0.026677)
reset_arrays! && puts(Benchmark.measure { #array1.concat(#array2) })
# => 0.000000 0.000000 0.000000 ( 0.000122)

Try this and see if this is any faster
a = [1,2,3,3,2]
b = [1,2,3,4,3,2,5,7]
(a+b).uniq

Related

How to do a single-line cumulative count for hash values in Ruby?

I've got the following data set:
{
Nov 2020=>1,
Dec 2020=>2,
Jan 2021=>3,
Feb 2021=>4,
Mar 2021=>5,
Apr 2021=>6
}
Using the following code:
cumulative_count = 0
count_data = {}
data_set.each { |k, v| count_data[k] = (cumulative_count += v) }
I'm producing the following set of data:
{
Nov 2020=>1,
Dec 2020=>3,
Jan 2021=>6,
Feb 2021=>10,
Mar 2021=>15,
Apr 2021=>21
}
Even though I've got the each as a single line, I feel like there's got to be some way to do the entire thing as a one-liner. I've tried using inject with no luck.
This would do the trick:
input.each_with_object([]) { |(key, value), arr| arr << [key, arr.empty? ? value : value + arr.last[1]] }.to_h
=> {"Nov 2020"=>1, "Dec 2020"=>3, "Jan 2021"=>6, "Feb 2021"=>10, "Mar 2021"=>15, "Apr 2021"=>21}
for input defined as:
input = {
'Nov 2020' => 1,
'Dec 2020' => 2,
'Jan 2021' => 3,
'Feb 2021' => 4,
'Mar 2021' => 5,
'Apr 2021' => 6
}
The idea is to inject an array (via each_with_object) to keep the processed data, and to allow us to easily get which is value of the the previous pair, and therefore allows us to accumulate the value. At the end, we transform this array into a hash so that we have the data structure we want to have.
Just to add a disclaimer, as the data being processed is a Hash (and therefore not a data structure that preserves order), a full one-liner to consider also a Hash ignoring any possible ordering would be the following:
input.to_a.sort_by { |pair| Date.parse(pair[0]) }.each_with_object([]) { |pair, arr| arr << [pair[0], arr.empty? ? pair[1] : pair[1] + arr.last[1]] }.to_h
=> {"Nov 2020"=>1, "Dec 2020"=>3, "Jan 2021"=>6, "Feb 2021"=>10, "Mar 2021"=>15, "Apr 2021"=>21}
In this case, we apply the same idea, but first converting the original data into an ordered array by date.
input = {
'Nov 2020' => 1,
'Dec 2020' => 2,
'Jan 2021' => 3,
'Feb 2021' => 4,
'Mar 2021' => 5,
'Apr 2021' => 6
}
If it must be on one physical line, and semicolons are allowed:
t = 0; input.each_with_object({}) { |(k, v), a| t += v; a[k] = t }
If it must be on one physical line, and semicolons are not allowed:
input.each_with_object({ t: 0, data: {}}) { |(k, v), a| (a[:t] += v) and (a[:data][k] = a[:t]) }[:data]
But in real practice, I think it's easier to read on multiple physical lines :)
t = 0
input.each_with_object({}) { |(k, v), a|
t += v
a[k] = t
}
TL;DR
This is what I ultimately ended up going with:
input.each_with_object({}) { |(k, v), h| h[k] = v + h.values.last.to_i }
Hats off to Marcos Parreiras (the accepted answer) for pointing me in the direction of each_with_object and the idea to pull the last value for accumulation instead of using += on a cumulative variable initialized as 0.
Details
I ended up with 3 potential solutions (listed below). My original code plus two options utilizing each_with_object – one of which depending on an array and the other on a hash.
Original
cumulative_count = 0
count_data = {}
input.each { |k, v| count_data[k] = (cumulative_count += v) }
Using array
input.each_with_object([]) { |(k, v), a| a << [k, v + a.last&.last.to_i] }.to_h
Using hash
input.each_with_object({}) { |(k, v), h| h[k] = v + h.values.last.to_i }
I settled on the option using the hash because I think it's the cleanest. However, it's worth noting that it's not the most performant. Based purely on performance, the original solution is hands-down the winner. Naturally, they're all extremely fast, so in order to really see the performance difference I had to run the options a very high number of times (displayed below). But since my actual solution will only be run once at a time in Production, I decided to go for succinctness over nanoseconds of performance. :-)
Performance
Each solution was run inside of 2_000_000.times { }.
Original
#<Benchmark::Tms:0x00007fde00fb72d8 #real=2.5452079999959096, #stime=0.09558999999999962, #total=2.5108440000000005, #utime=2.415254000000001>
Using array
#<Benchmark::Tms:0x00007fde0a1f58e8 #real=7.3623509999597445, #stime=0.08986500000000053, #total=7.250730000000002, #utime=7.160865000000001>
Using hash
#<Benchmark::Tms:0x00007f9e19ca7678 #real=5.903417999972589, #stime=0.057482000000000255, #total=5.830285999999999, #utime=5.772803999999999>
input = {
'Nov 2020' => 1,
'Dec 2020' => 2,
'Jan 2021' => 3,
'Feb 2021' => 4,
'Mar 2021' => 5,
'Apr 2021' => 6
}
If, as in the example, the values begin at 1 and each after the first is 1 greater than the previous value (recall key/value insertion order is guaranteed in hashes), the value n is to be converted to 1 + 2 +...+ n, which, being the sum of an arithmetic series, equals the following.
input.transform_values { |v| (1+v)*v/2 }
#=> {"Nov 2020"=>1, "Dec 2020"=>3, "Jan 2021"=>6, "Feb 2021"=>10,
# "Mar 2021"=>15, "Apr 2021"=>21}
Note that this does not require Hash#transform_values to process key-value pairs in any particular order.

Ruby: Increasing Efficiency

I am dealing with a large quantity of data and I'm worried about the efficiency of my operations at-scale. After benchmarking, the average time to execute this string of code is about 0.004sec. The goal of this line of code is to find the difference between the two values in each array location. In a previous operation, 111.111 was loaded into the arrays in locations which contained invalid data. Due to some weird time domain issues, I needed to do this because I couldn't just remove the values and I needed some distinguishable placeholder. I could probably use 'nil' here instead. Anyways, back to the explanation. This line of code checks to ensure neither array has this 111.111 placeholder in the current location. If the values are valid then I perform the mathematical operation, otherwise I want to delete the values (or at least exclude them from the new array to which I'm writing). I accomplished this by place a 'nil' in that location and then compacting the array afterwards.
The time of 0.004sec for 4000 data points in each array isn't terrible but this line of code is executed 25M times. I'm hoping someone might be able to offer some insight into how I might optimize this line of code.
temp_row = row_1.zip(row_2).map do |x, y|
x == 111.111 || y == 111.111 ? nil : (x - y).abs
end.compact
You are wasting CPU generating nil in the ternary statement, then using compact to remove them. Instead, use reject or select to find elements not containing 111.111 then map or something similar.
Instead of:
row_1 = [1, 111.111, 2]
row_2 = [2, 111.111, 4]
temp_row = row_1.zip(row_2).map do |x, y|
x == 111.111 || y == 111.111 ? nil : (x - y).abs
end.compact
temp_row # => [1, 2]
I'd start with:
temp_row = row_1.zip(row_2)
.reject{ |x,y| x == 111.111 || y == 111.111 }
.map{ |x,y| (x - y).abs }
temp_row # => [1, 2]
Or:
temp_row = row_1.zip(row_2)
.each_with_object([]) { |(x,y), ary|
ary << (x - y).abs unless (x == 111.111 || y == 111.111)
}
temp_row # => [1, 2]
Benchmarking different size arrays shows good things to know:
require 'benchmark'
DECIMAL_SHIFT = 100
DATA_ARRAY = (1 .. 1000).to_a
ROW_1 = (DATA_ARRAY + [111.111]).shuffle
ROW_2 = (DATA_ARRAY.map{ |i| i * 2 } + [111.111]).shuffle
Benchmark.bm(16) do |b|
b.report('ternary:') do
DECIMAL_SHIFT.times do
ROW_1.zip(ROW_2).map do |x, y|
x == 111.111 || y == 111.111 ? nil : (x - y).abs
end.compact
end
end
b.report('reject:') do
DECIMAL_SHIFT.times do
ROW_1.zip(ROW_2).reject{ |x,y| x == 111.111 || y == 111.111 }.map{ |x,y| (x - y).abs }
end
end
b.report('each_with_index:') do
DECIMAL_SHIFT.times do
ROW_1.zip(ROW_2)
.each_with_object([]) { |(x,y), ary|
ary += [(x - y).abs] unless (x == 111.111 || y == 111.111)
}
end
end
end
# >> user system total real
# >> ternary: 0.240000 0.000000 0.240000 ( 0.244476)
# >> reject: 0.060000 0.000000 0.060000 ( 0.058842)
# >> each_with_index: 0.350000 0.000000 0.350000 ( 0.349363)
Adjust the size of DECIMAL_SHIFT and DATA_ARRAY and the placement of 111.111 and see what happens to get an idea of what expressions work best for your data size and structure and fine-tune the code as necessary.
You can try the parallel gem https://github.com/grosser/parallel and run it on multiple threads

Creating a range from one column

I have a column called "Marks" which contains values like
Marks = [100,200,150,157,....]
I need to assign Grades to those marks using the following key
<25=0, <75=1, <125=2, <250=3, <500=4, >500=5
If Marks < 25, then Grade = 0, if marks < 75 then grade = 1.
I can sort the results and find the first record that matches using Ruby's find function. Is it the best method ? Or is there a way by which I can prepare a range using the key by adding Lower Limit and Upper Limit columns to the table and by populating those ranges using the key? Marks can have decimals too Ex: 99.99
Without using Rails, you could do it like this:
marks = [100, 200, 150, 157, 692, 12]
marks_to_grade = { 25=>0, 75=>1, 125=>2, 250=>3, 500=>4, Float::INFINITY=>5 }
Hash[marks.map { |m| [m, marks_to_grade.find { |k,_| m <= k }.last] }]
#=> {100=>2, 200=>3, 150=>3, 157=>3, 692=>5, 12=>0}
With Ruby 2.1, you could write this:
marks.map { |m| [m, marks_to_grade.find { |k,_| m <= k }.last] }.to_h
Here's what's happening:
Enumerable#map (a.k.a collect) converts each mark m to an array [m, g], where g is the grade computed for that mark. For example, when map passes the first element of marks into its block, we have:
m = 100
a = marks_to_grade.find { |k,_| m <= k }
#=> marks_to_grade.find { |k,_| 100 <= k }
#=> [125, 2]
a.last
#=> 2
so the mark 100 is mapped to [100, 2]. (I've replaced the block variable for the value of the key-value pair with the placeholder _ to draw attention to the fact that the value is not being used in the calculation within the block. One could also use, say, _v as the placeholder.) The remaining marks are similarly mapped, resulting in:
b = marks.map { |m| [m, marks_to_grade.find { |k,_| m <= k }.last] }
#=> [[100, 2], [200, 3], [150, 3], [157, 3], [692, 5], [12, 0]]
Lastly
Hash[b]
#=> {100=>2, 200=>3, 150=>3, 157=>3, 692=>5, 12=>0}
or, for Ruby 2.1+
b.to_h
#=> {100=>2, 200=>3, 150=>3, 157=>3, 692=>5, 12=>0}
You can make use of update_all:
Student.where(:mark => 0...25).update_all(grade: 0)
Student.where(:mark => 25...75).update_all(grade: 1)
Student.where(:mark => 75...125).update_all(grade: 2)
Student.where(:mark => 125...250).update_all(grade: 3)
Student.where(:mark => 250...500).update_all(grade: 4)
Student.where("mark > ?", 500).update_all(grade: 5)

Rails mapping array of hashes onto single hash

I have an array of hashes like so:
[{"testPARAM1"=>"testVAL1"}, {"testPARAM2"=>"testVAL2"}]
And I'm trying to map this onto single hash like this:
{"testPARAM2"=>"testVAL2", "testPARAM1"=>"testVAL1"}
I have achieved it using
par={}
mitem["params"].each { |h| h.each {|k,v| par[k]=v} }
But I was wondering if it's possible to do this in a more idiomatic way (preferably without using a local variable).
How can I do this?
You could compose Enumerable#reduce and Hash#merge to accomplish what you want.
input = [{"testPARAM1"=>"testVAL1"}, {"testPARAM2"=>"testVAL2"}]
input.reduce({}, :merge)
is {"testPARAM2"=>"testVAL2", "testPARAM1"=>"testVAL1"}
Reducing an array sort of like sticking a method call between each element of it.
For example [1, 2, 3].reduce(0, :+) is like saying 0 + 1 + 2 + 3 and gives 6.
In our case we do something similar, but with the merge function, which merges two hashes.
[{:a => 1}, {:b => 2}, {:c => 3}].reduce({}, :merge)
is {}.merge({:a => 1}.merge({:b => 2}.merge({:c => 3})))
is {:a => 1, :b => 2, :c => 3}
How about:
h = [{"testPARAM1"=>"testVAL1"}, {"testPARAM2"=>"testVAL2"}]
r = h.inject(:merge)
Every answers until now are advising to use Enumerable#reduce (or inject which is an alias) + Hash#merge but beware, while being clean, concise and human readable this solution will be hugely time consuming and have a large memory footprint on large arrays.
I have compiled different solutions and benchmarked them.
Some options
a = [{'a' => {'x' => 1}}, {'b' => {'x' => 2}}]
# to_h
a.to_h { |h| [h.keys.first, h.values.first] }
# each_with_object
a.each_with_object({}) { |x, h| h.store(x.keys.first, x.values.first) }
# each_with_object (nested)
a.each_with_object({}) { |x, h| x.each { |k, v| h.store(k, v) } }
# map.with_object
a.map.with_object({}) { |x, h| h.store(x.keys.first, x.values.first) }
# map.with_object (nested)
a.map.with_object({}) { |x, h| x.each { |k, v| h.store(k, v) } }
# reduce + merge
a.reduce(:merge) # take wayyyyyy to much time on large arrays because Hash#merge creates a new hash on each iteration
# reduce + merge!
a.reduce(:merge!) # will modify a in an unexpected way
Benchmark script
It's important to use bmbm and not bm to avoid differences are due to the cost of memory allocation and garbage collection.
require 'benchmark'
a = (1..50_000).map { |x| { "a#{x}" => { 'x' => x } } }
Benchmark.bmbm do |x|
x.report('to_h:') { a.to_h { |h| [h.keys.first, h.values.first] } }
x.report('each_with_object:') { a.each_with_object({}) { |x, h| h.store(x.keys.first, x.values.first) } }
x.report('each_with_object (nested):') { a.each_with_object({}) { |x, h| x.each { |k, v| h.store(k, v) } } }
x.report('map.with_object:') { a.map.with_object({}) { |x, h| h.store(x.keys.first, x.values.first) } }
x.report('map.with_object (nested):') { a.map.with_object({}) { |x, h| x.each { |k, v| h.store(k, v) } } }
x.report('reduce + merge:') { a.reduce(:merge) }
x.report('reduce + merge!:') { a.reduce(:merge!) }
end
Note: I initially tested with a 1_000_000 items array but as reduce + merge is costing exponentially much time it will take to much time to end.
Benchmark results
50k items array
Rehearsal --------------------------------------------------------------
to_h: 0.031464 0.004003 0.035467 ( 0.035644)
each_with_object: 0.018782 0.003025 0.021807 ( 0.021978)
each_with_object (nested): 0.018848 0.000000 0.018848 ( 0.018973)
map.with_object: 0.022634 0.000000 0.022634 ( 0.022777)
map.with_object (nested): 0.020958 0.000222 0.021180 ( 0.021325)
reduce + merge: 9.409533 0.222870 9.632403 ( 9.713789)
reduce + merge!: 0.008547 0.000000 0.008547 ( 0.008627)
----------------------------------------------------- total: 9.760886sec
user system total real
to_h: 0.019744 0.000000 0.019744 ( 0.019851)
each_with_object: 0.018324 0.000000 0.018324 ( 0.018395)
each_with_object (nested): 0.029053 0.000000 0.029053 ( 0.029251)
map.with_object: 0.021635 0.000000 0.021635 ( 0.021782)
map.with_object (nested): 0.028842 0.000005 0.028847 ( 0.029046)
reduce + merge: 17.331742 6.387505 23.719247 ( 23.925125)
reduce + merge!: 0.008255 0.000395 0.008650 ( 0.008681)
2M items array (excluding reduce + merge)
Rehearsal --------------------------------------------------------------
to_h: 2.036005 0.062571 2.098576 ( 2.116110)
each_with_object: 1.241308 0.023036 1.264344 ( 1.273338)
each_with_object (nested): 1.126841 0.039636 1.166477 ( 1.173382)
map.with_object: 2.208696 0.026286 2.234982 ( 2.252559)
map.with_object (nested): 1.238949 0.023128 1.262077 ( 1.270945)
reduce + merge!: 0.777382 0.013279 0.790661 ( 0.797180)
----------------------------------------------------- total: 8.817117sec
user system total real
to_h: 1.237030 0.000000 1.237030 ( 1.247476)
each_with_object: 1.361288 0.016369 1.377657 ( 1.388984)
each_with_object (nested): 1.765759 0.000000 1.765759 ( 1.776274)
map.with_object: 1.439949 0.029580 1.469529 ( 1.481832)
map.with_object (nested): 2.016688 0.019809 2.036497 ( 2.051029)
reduce + merge!: 0.788528 0.000000 0.788528 ( 0.794186)
Use #inject
hashes = [{"testPARAM1"=>"testVAL1"}, {"testPARAM2"=>"testVAL2"}]
merged = hashes.inject({}) { |aggregate, hash| aggregate.merge hash }
merged # => {"testPARAM1"=>"testVAL1", "testPARAM2"=>"testVAL2"}
Here you can use either inject or reduce from Enumerable class as both of them are aliases of each other so there is no performance benefit to either.
sample = [{"testPARAM1"=>"testVAL1"}, {"testPARAM2"=>"testVAL2"}]
result1 = sample.reduce(:merge)
# {"testPARAM1"=>"testVAL1", "testPARAM2"=>"testVAL2"}
result2 = sample.inject(:merge)
# {"testPARAM1"=>"testVAL1", "testPARAM2"=>"testVAL2"}

Regular expression in ruby and matching so many results

Trying to create a simple regular expression that can extract numbers(between 7 - 14) after a keyword starting with g letter and some id, something like following :
(g)(\d{1,6})\s+(\d{7,14}\s*)+
Lets assume :
m = (/(g)(\d{1,6})\s+(\d{7,14}\s*)+/i.match("g12 327638474 83873478 2387327683 44 437643673476"))
I have results of :
#<MatchData "g23333 327638474 83873478 2387327683 " "g" "12" "2387327683 ">
But what I need as a final result , to include, 327638474, 83873478, 2387327683 and exclude 44.
For now I just getting the last number 2387327683 with not including the previous numbers
Any help here .
cheers
Instead of a regex, you can use something like that:
s = "g12 327638474 83873478 2387327683 44 437643673476"
s.split[1..-1].select { |x| (7..14).include?(x.size) }.map(&:to_i)
# => [327638474, 83873478, 2387327683, 437643673476]
Just as a FYI, here is a benchmark showing a bit faster way of accomplishing the selected answer:
require 'ap'
require 'benchmark'
n = 100_000
s = "g12 327638474 83873478 2387327683 44 437643673476"
ap s.split[1..-1].select { |x| (7..14).include? x.size }.map(&:to_i)
ap s.split[1..-1].select { |x| 7 <= x.size && x.size <= 14 }.map(&:to_i)
Benchmark.bm(11) do |b|
b.report('include?' ) { n.times{ s.split[1..-1].select { |x| (7..14).include? x.size }.map(&:to_i) } }
b.report('conditional') { n.times{ s.split[1..-1].select { |x| 7 <= x.size && x.size <= 14 }.map(&:to_i) } }
end
ruby ~/Desktop/test.rb
[
[0] 327638474,
[1] 83873478,
[2] 2387327683,
[3] 437643673476
]
[
[0] 327638474,
[1] 83873478,
[2] 2387327683,
[3] 437643673476
]
user system total real
include? 1.010000 0.000000 1.010000 ( 1.011725)
conditional 0.830000 0.000000 0.830000 ( 0.825746)
For speed I'll use the conditional test. It's a tiny bit more verbose, but is still easily read.

Resources