I am working on some data processing in a Rails app and I am trying to deal with into a performance pain-point. I have 2 arrays x_data and y_data that each looks as follows (With different values of course):
[
{ 'timestamp_value' => '2017-01-01 12:00', 'value' => '432' },
{ 'timestamp_value' => '2017-01-01 12:01', 'value' => '421' },
...
]
Each array has up to perhaps 25k items. I need to prepare this data for further x-y regression analysis.
Now, some values in x_data or y_data can be nil. I need to remove values from both arrays if either x_data or y_data has a nil value at that timestamp. I then need to return the values only for both arrays.
In my current approach, I am first extracting the timestamps from both arrays where the values are not nil, then performing a set intersection on the timestamps to produce a final timestamps array. I then select values using that final array of timestamps. Here's the code:
def values_for_regression(x_data, y_data)
x_timestamps = timestamps_for(x_data)
y_timestamps = timestamps_for(y_data)
# Get final timestamps as the intersection of the two
timestamps = x_timestamps.intersection(y_timestamps)
x_values = values_for(x_data, timestamps)
y_values = values_for(y_data, timestamps)
[x_values, y_values]
end
def timestamps_for(data)
Set.new data.reject { |row| row['value'].nil? }.
map { |row| row['timestamp_value'] }
end
def values_for(data, timestamps)
data.select { |row| timestamps.include?(row['timestamp_value']) }.
map { |row| row['value'] }
end
This approach isn't terribly performant, and I need to do this on several sets of data in quick succession. The overhead of the multiple loops adds up. There must be a way to at least reduce the number of loops necessary.
Any ideas or suggestions will be appreciated.
You're doing a lot of redundant iterating and creating a lot of intermediate arrays of data.
Yourtimestamps_for and values_for both perform a select followed by a map. The select creates an intermediate array; since your arrays are up to 25,000 items, this is potentially an intermediate throw-away array of the same size. You're doing this four times, once for x and y timestamps, and once for x and y values. You produce another intermediate array by taking the intersection of the two sets of timestamps. You also do a complete scan of both arrays for nils twice, once to find timestamps with non-nil values, and again mapping the timestamps you just extracted to their values.
While it's definitely more readable to functionally transform the input arrays, you can dramatically reduce memory usage and execution time by combining the various iterations and transformations.
All the iterations can be combined into a single loop over one data set (along with setup time for producing a timestamp->value lookup hash for the second set). Any timestamps not present in the first set will make a timestamp in the second set ignored anyways, so there is no reason to find all the timestamps in both sets, only to then find their intersection.
def values_for_regression(x_data, y_data)
x_values = []
y_values = []
y_map = y_data.each_with_object({}) { |data, hash| hash[data['timestamp-value']] = data['value'] }
x_data.each do |data|
next unless x_value = data['value']
next unless y_value = y_map[data['timestamp-value']]
x_values << x_value
y_values << y_value
end
[x_values, y_values]
end
I think this is functionally identical, and a quick benchmark shows a ~70% reduction in runtime:
user system total real
yours 9.640000 0.150000 9.790000 ( 9.858914)
mine 2.780000 0.060000 2.840000 ( 2.845621)
Related
Given three arrays of unique ids, where the goal is to create individual identifiers that join each member of the three arrays
array_a = [1,2]
array_b = [43,44,47]
array_c = [3,15]
this implies 2 * 3 * 2 individual identifiers (seperated by underscores for legibility purposes):
1_43_3, 1_43_15, 1_44_3, 1_44_15, 1_47_3, 1_47_15, 2_43_3, 2_43_15, 2_44_3, 2_44_15, 2_47_3, 2_47_15
Is there a ruby method that allows to create such a set, i.e. to multiply arrays of arrays ?
Use product method
Input
array_a = [1,2]
array_b = [43,44,47]
array_c = [3,15]
Program
p array_a.product(array_b,array_c).map{|x|x.join("_")}
Output
["1_43_3", "1_43_15", "1_44_3", "1_44_15", "1_47_3", "1_47_15", "2_43_3", "2_43_15", "2_44_3", "2_44_15", "2_47_3", "2_47_15"]
Not to my knowledge, but it's fairly trivial to implement with a couple of loops:
array_a = [1,2]
array_b = [43,44,47]
array_c = [3,15]
combined = array_a.flat_map do |a|
array_b.flat_map do |b|
array_c.map do |c|
[a, b, c].join("_")
end
end
end
Edit - although the solution using product from #Rajagopalan is very neat.
This is not an answer, just shedding light on the two valid answers provided.
Running a performance test in the following manner:
time = Benchmark.measure {
code_to_test
}
puts time
with four data sets:
the first an array with sizes 10x10x10,
the second an array with sizes 20x20x20, which is almost an order of magnitude greater than the former, then
a third array with sizes 30x30x30. and a final
40x40x40, almost another order of magnitude.
The product method returns for each array
user system total
0.002692 0.000340 0.003032
0.057010 0.003608 0.060618
0.078614 0.010978 0.089592
0.217555 0.015326 0.232881
while the nested flat_map array returns
0.002562 0.000145 0.002707
0.077731 0.001857 0.079588
0.085422 0.001829 0.087251
0.263692 0.005506 0.269198
rather indistinguisahble, even at relatively high numbers.
if a table of N integer is present how to check if an element is repeating if present it shows message that table has repeating elements, if this is to be achieved in minimum time complexity
Hash table is the way to go (ie normal Lua table). Just loop over each integer and place it into the table as the key but first check if the key already exists. If it does then you have a repeat value. So something like:
values = { 1, 2, 3, 4, 5, 1 } -- input values
local htab = {}
for _, v in ipairs(values) do
if htab[v] then print('duplicate value: ' .. v)
else htab[v] = true end
end
With small integer values the table will use an array so will be O(1) to access. With larger and therefore sparser values the values will be in the hash table part of the table which can just be assumed to be O(1) as well. And since you have N values to insert this is O(N).
Getting faster than O(N) should not be possible since you have to visit each value in the list at least once.
I’m using Rails 4.2.7. I have two arrays, (arr1 and arr2) that both contain my model objects. Is there a way to do an intersection on both arrays if an object from arr1 has a field, “myfield1,” (which is a number) that matches an object in arr2? Both arrays will have unique sets of objects. Currently I have
arr1.each_with_index do |my_object, index|
arr2.each_with_index do |my_object2, index|
if my_object.myfield1 == my_object2.myfield1
results.push(my_object)
end
end
end
but this strikes me as somewhat inefficient. I figure there’s a simpler way to get the results I need but am not versed enough in Ruby to know how to do it.
You can build an intersection of the values to find the common values, then select records that have the common values.
field_in_both = arr1.map(&:myfield1) & arr2.map(&:myfield1)
intersection = arr1.select{|obj| field_in_both.include? obj.myfield1} +
arr2.select{|obj| field_in_both.include? obj.myfield1}
I notice in your code, you're only storing records from arr1... if that's correct behaviour then you can simplify my answer
field_in_both = arr1.map(&:myfield1) & arr2.map(&:myfield1)
intersection = arr1.select{|obj| field_in_both.include? obj.myfield1}
I have written a method to calculate a given percentile for a set of numbers for use in an application I am building. Typically the user needs to know the 25th percentile of a given set of numbers and the 75th percentile.
My method is as follows:
def calculate_percentile(array,percentile)
#get number of items in array
return nil if array.empty?
#sort the array
array.sort!
#get the array length
arr_length = array.length
#multiply items in the array by the required percentile (e.g. 0.75 for 75th percentile)
#round the result up to the next whole number
#then subtract one to get the array item we need to return
arr_item = ((array.length * percentile).ceil)-1
#return the matching number from the array
return array[arr_item]
end
This looks to provide the results I was expecting but can anybody refactor this or offer an improved method to return specific percentiles for a set of numbers?
Some remarks:
If a particular index of an Array does not exist, [] will return nil, so your initial check for an empty Array is unnecessary.
You should not sort! the Array argument, because you are affecting the order of the items in the Array in the code that called your method. Use sort (without !) instead.
You don't actually use arr_length after assignment.
A return statement on the last line is unnecessary in Ruby.
There is no standard definition for the percentile function (there can be a lot of subtleties with rounding), so I'll just assume that how you implemented it is how you want it to behave. Therefore I can't really comment on the logic.
That said, the function that you wrote can be written much more tersely while still being readable.
def calculate_percentile(array, percentile)
array.sort[(percentile * array.length).ceil - 1]
end
Here's the same refactored into a one liner. You don't need an explicit return as the last line in Ruby. The return value of the last statement of the method is what's returned.
def calculate_percentile(array=[],percentile=0.0)
# multiply items in the array by the required percentile
# (e.g. 0.75 for 75th percentile)
# round the result up to the next whole number
# then subtract one to get the array item we need to return
array ? array.sort[((array.length * percentile).ceil)-1] : nil
end
Not sure if it's worth it, but here is how I did it for the quartiles:
def median(list)
(list[(list.size - 1) / 2] + list[list.size / 2]) / 2
end
numbers = [1, 2, 3, 4, 5, 6]
if numbers.size % 2 == 0
puts median(numbers[0...(numbers.size / 2)])
puts median(numbers)
puts median(numbers[(numbers.size / 2)..-1])
else
median_index = numbers.index(median(numbers))
puts median(numbers[0..(median_index - 1)])
puts median(numbers)
puts median(numbers[(median_index + 1)..-1])
end
If you're calculating both quartiles, you might want to move the "sort" outside the function, so that it only needs to be done once. This also means you aren't modifying your caller's data (sort!), nor making a copy every time the function is called (sort).
I know, premature optimisation and all that. And it's a bit awkward for the function to say, "the array must be sorted before calling this function". So it's reasonable to leave it as it is.
But sorting already-sorted data is going to take considerably longer than the whole rest of the function put together(*). It also has higher algorithmic complexity: O(N) at best, when the function could be O(1) for the second quartile (although O(N log N) for the first one if the data is not already sorted, of course). So it's worth avoiding if performance might ever be an issue for this function.
There are slightly faster ways of finding the two quartiles than a full sort (look up "selection algorithms"). For instance if you're familiar with the way qsort uses pivots, observe that if you need to know the 25th and 75th items out of 100, and your pivot at some stage ends up in position 80, then there's absolutely no point recursing into the block above the pivot. You really don't care what order those elements are in, just that they're in the top quartile. But this will considerably increase the complexity of the code compared with just calling a library to sort for you. Unless you really need a minor performance boost, I think you're good as you are.
(*) Unless ruby arrays have a flag to remember they're already sorted and haven't been modified since. I don't know whether they do, but if so then using sort! a second time is of course free.
I am collecting the values for a specific column from a named_scope as follows:
a = survey_job.survey_responses.collect(&:base_pay)
This gives me a numeric array for example (1,2,3,4,5). I can then pass this array into various functions I have created to retrieve the mean, median, standard deviation of the number set. This all works fine however I now need to start combining multiple columns of data to carry out the same types of calculation.
I need to collect the details of perhaps three fields as follows:
survey_job.survey_responses.collect(&:base_pay)
survey_job.survey_responses.collect(&:bonus_pay)
survey_job.survey_responses.collect(&:overtime_pay)
This will give me 3 arrays. I then need to combine these into a single array by adding each of the matching values together - i.e. add the first result from each array, the second result from each array and so on so I have an array of the totals.
How do I create a method which will collect all of this data together and how do I call it from the view template?
Really appreciate any help on this one...
Thanks
Simon
s = survey_job.survey_responses
pay = s.collect(&:base_pay).zip(s.collect(&:bonus_pay), s.collect(&:overtime_pay))
pay.map{|i| i.compact.inject(&:+) }
Do that, but with meaningful variable names and I think it will work.
Define a normal method in app/helpers/_helper.rb and it will work in the view
Edit: now it works if they contain nil or are of different sizes (as long as the longest array is the one on which zip is called.
Here's a method that will combine an arbitrary number of arrays by taking the sum at each index. It'll allow each array to be of different length, too.
def combine(*arrays)
# Get the length of the largest array, that'll be the number of iterations needed
maxlen = arrays.map(&:length).max
out = []
maxlen.times do |i|
# Push the sum of all array elements at a given index to the result array
out.push( arrays.map{|a| a[i]}.inject(0) { |memo, value| memo += value.to_i } )
end
out
end
Then, in the controller, you could do
base_pay = survey_job.survey_responses.collect(&:base_pay)
bonus_pay = survey_job.survey_responses.collect(&:bonus_pay)
overtime_pay = survey_job.survey_responses.collect(&:overtime_pay)
#total_pay = combine(base_pay, bonus_pay, overtime_pay)
And then refer to #total_pay as needed in your view.