I'm working on an AB test and need to divide a population. How should I divide something like
User.where(:condition => true) randomly into two roughly equal groups?
I'm considering iterating through the whole array and pushing onto one of two other arrays based on a random value, but this is a large query and that sounds very slow.
e.g.
array.each do |object|
if rand(2) == 0
first_group << object
else
second_group << object
end
end
To get a random ordering right from the database you can do
# MySQL
User.order('RAND()')
# PostgreSQL
User.order('RANDOM()')
A nice one liner to split an array into two halves can be found here:
left, right = a.each_slice( (a.size/2.0).round ).to_a
I would write a definition which would return the following
def randomizer(sample_size)
initial_arr = ["objq","obj2", "objn"]
sampler = initial_arr(sample_size)
sampled_data = initial_arr - sampler
end
here sample_size will be size of the array would like to randomize and split like 50 or 100 based on your data size.
for basic trial I have done the same as
[:foo, :bar, :hello, :world, :ruby].sample(3)
output would be [:hello, :ruby, :bar].
second would be the result of [:foo, :bar, :hello, :world, :ruby] - [:foo, :bar, :hello, :world, :ruby].sample(3) which is [:hello, :world]
This way you can avoid looping over array and execute the code faster.
for additional information you can check http://www.ruby-doc.org/core-2.1.1/Array.html#method-i-sample
This is available in ruby 1.9.1 as well.
You could perform basic Array operations on the query result:
results = User.where(:condition => true)
Start by using shuffle to put the array to get a random ordering:
array = results.shuffle
Then slice the array in to roughly equal parts:
group1 = array.slice(0..array.length/2)
group2 = array.slice(array.length/2+1..array.length)
If order is important, sort the groups back into the initial order:
group1.sort! {|a, b| results.index(a) <=> results.index(b) }
group2.sort! {|a, b| results.index(a) <=> results.index(b) }
Related
Context:
Trying to generating an array with 1 element for each created_at day in db table. Each element is the average of the points (integer) column from records with that created_at day.
This will later be graphed to display the avg number of points on each day.
Result:
I've been successful in doing this, but it feels like an unnecessary amount of code to generate the desired result.
Code:
def daily_avg
# get all data for current user
records = current_user.rounds
# make array of long dates
long_date_array = records.pluck(:created_at)
# create array to store short dates
short_date_array = []
# remove time of day
long_date_array.each do |date|
short_date_array << date.strftime('%Y%m%d')
end
# remove duplicate dates
short_date_array.uniq!
# array of avg by date
array_of_avg_values = []
# iterate through each day
short_date_array.each do |date|
temp_array = []
# make array of records with this day
records.each do |record|
if date === record.created_at.strftime('%Y%m%d')
temp_array << record.audio_points
end
end
# calc avg by day and append to array_of_avg_values
array_of_avg_values << temp_array.inject(0.0) { |sum, el| sum + el } / temp_array.size
end
render json: array_of_avg_values
end
Question:
I think this is a common extraction problem needing to be solved by lots of applications, so I'm wondering if there's a known repeatable pattern for solving something like this?
Or a more optimal way to solve this?
(I'm barely a junior developer so any advice you can share would be appreciated!)
Yes, that's a lot of unnecessary stuff when you can just go down to SQL to do it (I'm assuming you have a class called Round in your app):
class Round
DAILY_AVERAGE_SELECT = "SELECT
DATE(rounds.created_at) AS day_date,
AVG(rounds.audio_points) AS audio_points
FROM rounds
WHERE rounds.user_id = ?
GROUP BY DATE(rounds.created_at)
"
def self.daily_average(user_id)
connection.select_all(sanitize_sql_array([DAILY_AVERAGE_SELECT, user_id]), "daily-average")
end
end
Doing this straight into the database will be faster (and also include less code) than doing it in ruby as you're doing now.
I advice you to do something like this:
grouped =
records.order(:created_at).group_by do |r|
r.created_at.strftime('%Y%m%d')
end
At first here you generate proper SQL near to that you wish to get in first approximation, then group result records by created_at field converted to just a date.
points =
grouped.map do |(date, values)|
[ date, values.reduce(0.0, :audio_points) / values.size ]
end.to_h
# => { "1-1-1970" => 155.0, ... }
Then you remap your grouped hash via array, to calculate average values with audio_points.
You can use group and calculations methods built in AR: http://guides.rubyonrails.org/active_record_querying.html#group
http://guides.rubyonrails.org/active_record_querying.html#calculations
I'm trying to implement my first ruby sorting algorithm. This algorithm is based on some specific rules ("always prefer objects of type xxx over objects of types yyy"), and if none of these rules triggered, it uses the ruby <=>-operator. I'm doing this on a ruby-on-rails one-to-many association.
The problem is this algortihm does not return the array itself, it just returns -1 or 1, the result of the comparison..But I actually don't understand why, as my result is only returned in the sort-block.
Here is my current code:
def sort_products!
products.sort! do |p1, p2|
result = 0
# Scalable Products are always the last ones in order
if p1.class.name == "ScalableProduct"
result = -1
elsif p2.class.name == "ScalableProduct"
result = 1
end
if result == 0
# Put products producing electricity and heating down
if p1.can_deliver_electricity?
result = -1
elsif p2.can_deliver_electricity?
result = 1
end
end
# Else: Just compare names
result = p1.name <=> p2.name if result == 0
result
end
end
The best practice here, in my opinion, would be to implement the <=> inside the Product model. You'll need to include the Comparable model in order to achive this:
class Product
include Comparable
def <=>(another_product)
# Compare self with another_product
# Return -1, 0, or 1
end
end
Then your sorting method will be reduced to:
def sort_products!
products.sort!
end
Change the do..end for brackets as delimiters of the block. It is first sorting, and then using the block on the result (because of the precedence of the do..end syntax). Using brackets, it uses the block as a sorting block, which is what you wanted.
Also, in your comparison, if both your products are ScalableProduct then you will not order them in a sensible way. If they are both ScalableProduct at the same time, you might want to keep result as 0 so it falls back to comparing by name. Same deal with can_deliver_electricity?.
Lets say I have a Collection of users. Is there a way of using mongoid to find n random users in the collection where it does not return the same user twice? For now lets say the user collection looks like this:
class User
include Mongoid::Document
field :name
end
Simple huh?
Thanks
If you just want one document, and don't want to define a new criteria method, you could just do this:
random_model = Model.skip(rand(Model.count)).first
If you want to find a random model based on some criteria:
criteria = Model.scoped_whatever.where(conditions) # query example
random_model = criteria.skip(rand(criteria.count)).first
The best solution is going to depend on the expected size of the collection.
For tiny collections, just get all of them and .shuffle.slice!
For small sizes of n, you can get away with something like this:
result = (0..User.count-1).sort_by{rand}.slice(0, n).collect! do |i| User.skip(i).first end
For large sizes of n, I would recommend creating a "random" column to sort by. See here for details: http://cookbook.mongodb.org/patterns/random-attribute/ https://github.com/mongodb/cookbook/blob/master/content/patterns/random-attribute.txt
MongoDB 3.2 comes to the rescue with $sample (link to doc)
EDIT : The most recent of Mongoid has implemented $sample, so you can call YourCollection.all.sample(5)
Previous versions of mongoid
Mongoid doesn't support sample until Mongoid 6, so you have to run this aggregate query with the Mongo driver :
samples = User.collection.aggregate([ { '$sample': { size: 3 } } ])
# call samples.to_a if you want to get the objects in memory
What you can do with that
I believe the functionnality should make its way soon to Mongoid, but in the meantime
module Utility
module_function
def sample(model, count)
ids = model.collection.aggregate([
{ '$sample': { size: count } }, # Sample from the collection
{ '$project': { _id: 1} } # Keep only ID fields
]).to_a.map(&:values).flatten # Some Ruby magic
model.find(ids)
end
end
Utility.sample(User, 50)
If you really want simplicity you could use this instead:
class Mongoid::Criteria
def random(n = 1)
indexes = (0..self.count-1).sort_by{rand}.slice(0,n).collect!
if n == 1
return self.skip(indexes.first).first
else
return indexes.map{ |index| self.skip(index).first }
end
end
end
module Mongoid
module Finders
def random(n = 1)
criteria.random(n)
end
end
end
You just have to call User.random(5) and you'll get 5 random users.
It'll also work with filtering, so if you want only registered users you can do User.where(:registered => true).random(5).
This will take a while for large collections so I recommend using an alternate method where you would take a random division of the count (e.g.: 25 000 to 30 000) and randomize that range.
You can do this by
generate random offset which will further satisfy to pick the next n
elements (without exceeding the limit)
Assume count is 10, and the n is 5
to do this check the given n is less than the total count
if no set the offset to 0, and go to step 8
if yes, subtract the n from the total count, and you will get a number 5
Use this to find a random number, the number definitely will be from 0 to 5 (Assume 2)
Use the random number 2 as offset
now you can take the random 5 users by simply passing this offset and the n (5) as a limit.
now you get users from 3 to 7
code
>> cnt = User.count
=> 10
>> n = 5
=> 5
>> offset = 0
=> 0
>> if n<cnt
>> offset = rand(cnt-n)
>> end
>> 2
>> User.skip(offset).limit(n)
and you can put this in a method
def get_random_users(n)
offset = 0
cnt = User.count
if n < cnt
offset = rand(cnt-n)
end
User.skip(offset).limit(n)
end
and call it like
rand_users = get_random_users(5)
hope this helps
Since I want to keep a criteria, I do:
scope :random, ->{
random_field_for_ordering = fields.keys.sample
random_direction_to_order = %w(asc desc).sample
order_by([[random_field_for_ordering, random_direction_to_order]])
}
Just encountered such a problem. Tried
Model.all.sample
and it works for me
The approach from #moox is really interesting but I doubt that monkeypatching the whole Mongoid is a good idea here. So my approach is just to write a concern Randomizable that can included in each model you use this feature. This goes to app/models/concerns/randomizeable.rb:
module Randomizable
extend ActiveSupport::Concern
module ClassMethods
def random(n = 1)
indexes = (0..count - 1).sort_by { rand }.slice(0, n).collect!
return skip(indexes.first).first if n == 1
indexes.map { |index| skip(index).first }
end
end
end
Then your User model would look like this:
class User
include Mongoid::Document
include Randomizable
field :name
end
And the tests....
require 'spec_helper'
class RandomizableCollection
include Mongoid::Document
include Randomizable
field :name
end
describe RandomizableCollection do
before do
RandomizableCollection.create name: 'Hans Bratwurst'
RandomizableCollection.create name: 'Werner Salami'
RandomizableCollection.create name: 'Susi Wienerli'
end
it 'returns a random document' do
srand(2)
expect(RandomizableCollection.random(1).name).to eq 'Werner Salami'
end
it 'returns an array of random documents' do
srand(1)
expect(RandomizableCollection.random(2).map &:name).to eq ['Susi Wienerli', 'Hans Bratwurst']
end
end
I think it is better to focus on randomizing the returned result set so I tried:
Model.all.to_a.shuffle
Hope this helps.
It's a vague question I know....but the performance on this block of code is horrible. It takes about 15secs from the original post to the action to rendering the page...
The purpose of this action is to retrieve all Occupations from a CV, all the skills from that CV and the occupations. They need to be organized in 2 arrays:
the first array contains all the Occupations (no duplicates) and has them ordered according their score. Fo each double entry found the score is increased by 1
the second array contains ALL the skills from both the occupation array and the cv. Again no doubles are allowed, but for every double encountered the score of the existing is increased by one.
Below is the code block that performs this operation. It's relatively big compared to my other code snippets, but i hope it's understandable. I know working with the arrays like i do is confusing, but here is what each array location means:
position 0 : the actuall skill/occupation object
position 1 : the score of the entry
position 2 : the location found in the db
position 3 : the location found in the cv
def categorize
#cv = Cv.find(params[:cv_id], :include => [:desired_occupations, :past_occupations, :educational_skills])
#menu = :second
#language = Language.resolve(:code => :en, :name => :en)
#occupation_hashes = []
#skill_hashes = []
(#cv.desired_occupations + #cv.past_occupations).each do |occupation|
section = []
section << 'Desired occupation' if #cv.desired_occupations.include? occupation
section << 'Work experience' if #cv.past_occupations.include? occupation
unless (array = #occupation_hashes.assoc(occupation)).blank?
array[1] += 1
array[2] = (array[2] & section).uniq
else
#occupation_hashes << [occupation, 1, section]
end
occupation.skills.each do |skill|
unless (array = #skill_hashes.assoc skill).blank?
label = occupation.concept.label(#language).value
array[1]+= 1
array[3] << label unless array[3].include? label
else
#skill_hashes << [skill, 1, [], [occupation.concept.label(#language).value]]
end
end
end
#cv.educational_skills.each do |skill|
unless (array = #skill_hashes.assoc skill).blank?
array[1]+= 1
array[3] << 'Education skills' unless array[3].include? 'Education skills'
else
#skill_hashes << [skill, 1, ['Education skills'], []]
end
end
# Sort the hashes
#occupation_hashes.sort! { |x,y| y[1] <=> x[1]}
#skill_hashes.sort! { |x,y| y[1] <=> x[1]}
#max = #skill_hashes.first[1]
#min = #skill_hashes.last[1] end
I can post the additional models and migrations to make it clear what each class does, but I think the first few lines of the above script should be clear on the associations. I'm looking for a way to optimize the each-loops...
That's quite the block of code there. Generally if you're writing methods that serious you're going to have trouble maintaining it in the future. A technique that would help is breaking up that monolithic chunk of code and turning it into a helper class that does the processing in more logical stages, making it easier to fine-tune aspects of it.
For instance, an interface might be:
#categorizer = CvCategorizer.new(params[:cv_id])
This would encapsulate all of the above and save it into instance variables made accessible by being declared with attr_reader.
Using a utility class means you can break up the initialization into steps that are made more clear:
def initialize(cv_id)
# Call a wrapper method that loads the CV
#cv = self.load_cv(cv_id)
# Perform discrete steps to re-order the imported data
self.organize_occupations
self.organize_skills
end
It's really hard to say why this is slow by just looking at it, though I would pay very close attention to log/development.log to see what's going on in there. It could be the initial load is painfully slow but the rest of the method is fine.
You should do a but of profiling in your code to see what is taking a large chunk of time. You can figure out how to work on of the profilers, or just sprinkle some simple puts or logger.info statements throughout your code with a timestamp. Probably easiest to do this by using Benchmark. Note: you may need to require 'benchmark'... not sure if it is auto required in Rails or not.
For a single line, you can do something like this:
logger.info Benchmark.measure { #cv = Cv.find(params[:cv_id], :include => [:desired_occupations, :past_occupations, :educational_skills]) }
And for timing larger blocks of code:
logger.info Benchmark.measure do
(#cv.desired_occupations + #cv.past_occupations).each do |occupation|
section = []
section << 'Desired occupation' if #cv.desired_occupations.include? occupation
section << 'Work experience' if #cv.past_occupations.include? occupation
unless (array = #occupation_hashes.assoc(occupation)).blank?
array[1] += 1
array[2] = (array[2] & section).uniq
else
#occupation_hashes << [occupation, 1, section]
end
end
end
I'd just start with large blocks and then narrow it down. Not knowing how large of a dataset you are dealing with, it is hard to say what the problem zone is.
I'll also concur with others that you will be way better off to break this thing into smaller methods. This will also make it easier to test for performance, as you can do things like:
Benchmark.measure { 10000.times { foo.do_that_thing_that_might_be_slow }}
I'm doing this:
#snippets = Snippet.find :all, :conditions => { :user_id => session[:user_id] }
#snippets.each do |snippet|
snippet.tags.each do |tag|
#tags.push tag
end
end
But if a snippets has the same tag two time, it'll push the object twice.
I want to do something like if #tags.in_object(tag)[...]
Would it be possible? Thanks!
I think there are 2 ways to go about it to get a faster result.
1) Add a condition to your find statement ( in MySQL DISTINCT ). This will return only unique result. DBs in general do much better jobs than regular code at getting results.
2) Instead if testing each time with include, why don't you do uniq after you populate your array.
here is example code
ar = []
data = []
#get some radom sample data
100.times do
data << ((rand*10).to_i)
end
# populate your result array
# 3 ways to do it.
# 1) you can modify your original array with
data.uniq!
# 2) you can populate another array with your unique data
# this doesn't modify your original array
ar.flatten << data.uniq
# 3) you can run a loop if you want to do some sort of additional processing
data.each do |i|
i = i.to_s + "some text" # do whatever you need here
ar << i
end
Depending on the situation you may use either.
But running include on each item in the loop is not the fastest thing IMHO
Good luck
Another way would be to simply concat the #tags and snippet.tags arrays and then strip it of duplicates.
#snippets.each do |snippet|
#tags.concat(snippet.tags)
end
#tags.uniq!
I'm assuming #tags is an Array instance.
Array#include? tests if an object is already included in an array. This uses the == operator, which in ActiveRecord tests for the same instance or another instance of the same type having the same id.
Alternatively, you may be able to use a Set instead of an Array. This will guarantee that no duplicates get added, but is unordered.
You can probably add a group to the query:
Snippet.find :all, :conditions => { :user_id => session[:user_id] }, :group => "tag.name"
Group will depend on how your tag data works, of course.
Or use uniq:
#tags << snippet.tags.uniq