Randomize Selections in a List of 100 - ruby-on-rails

This is a follow-up to this last question I asked: Sort Users by Number of Followers. That code is:
#ordered_users = User.all.sort{|a,b| b.followers.count <=> a.followers.count}
What I hope to accomplish is take the ordered users and get the top 100 of those and then randomly choose 5 out of that 100. Is there a way to accomplish this?
Thanks.

users_in_descending_order_of_followers = User.all.sort_by { |u| -u.followers.count }
sample_of_top = users_in_descending_order_of_followers.take(100).sample(5)
You can use sort_by which can be easier to use than sort, and combine take and sample to get the top 100 users and sample 5 of those users.

User.all.sort can "potentially" pose some problems in the long-run, depending on the number of total users, and the availability of resources particularly computer memory, not to mention it would be a lot slower because you're calling 2x .followers.count inside the sort block, which essentially calls 2xN times more DB query; N being the number of users. This is because User.all.sort will immediately execute the User.all query, thereby fetching all User records into memory, as opposed to your usual User.all, which is lazy loaded, until you (for example use .each, or better yet .find_each somewhere down the line)
I suggest something like below (I extended Deekshith's answer referring to your link to the other question):
User.joins(:followers).order('count(followers.user_id) desc').limit(100).sample(5)
.joins, .order, and .limit above will all extend the SQL string query into one string, then executes that SQL string, and finally run .sample(5) (not a SQL anymore!, but is already just a plain ruby method at this point), finally yielding the result that you needed.

I would strongly consider using a counter cache on the User model, to hold the count of followers.
This would give a very small performance impact on adding or removing followers, and greatly increase performance when performing sorts:
User.order(followers_count: :desc)
This would be particularly noticeable if you wanted the top-n users by follower count, or finding users with no followers.
User.order(followers_count: :desc).limit(100).sample(5)
This method will out-perform others using count(*). Add an index on followers_count for best effect.

Related

How can I optimise this method in Ruby using preload, includes, or eager_load?

I want to reduce allocations and speed up a Ruby worker. I've been reading about eager loading, but I don't fully understand it yet. Here's the method:
def perform(study_id, timestamp)
study = Study.includes(:questions, :participants).find(study_id)
questions = study.questions.not_random.not_paused
participants = study.participants
return unless questions && participants
end_timestamp = timestamp_window(timestamp)
participants.each do |participant|
process_participant(participant, questions, timestamp, end_timestamp, study)
end
end
I was hoping that Study.includes() would reduce the number of database queries, but looking at Skylight, it doesn't seem to have changed anything:
Am I using includes incorrectly, or should I be using something else?
The example you've given doesn't seem like it's benefiting much from eager loading. Its utility is to avoid N+1 queries; something like this:
User.first(100).each do |user|
comments = user.comments
end
This will make 1 query for the 100 users, and 100 queries for the comments. That's why it's called N+1 (N being 100 here).
To prevent this from happening, you'd use eager loading:
User.first(100).includes(:comments).each do |user|
comments = user.comments
end
Now it makes two queries - one for the users and one for the comments. The fact that it makes 2 queries instead of 1 isn't a problem. Part of optimization (big O) is to find bottlenecks at different 'scales'. I'm not going to explain all that, but this is a good tutorial: https://samurails.com/interview/big-o-notation-complexity-ruby/
In the example without eager loading, the time complexity is O(N), which means 'linear'. The time required increases linearly with the value of N. If you use eager loading, though, then you can increase N without adding additional queries, and it's a O(1) complexity - constant time.
In your case, you have a method that makes three queries:
Study (find one)
associated questions
associated participants
An easy way to determine if you should use eager loading is to check your code for any SQL fetching that happens inside a loop. That's not happening here, so the eager loading won't do much. For example, it'd be good to use includes if you were instead fetching associated data for a list of studies.
It might technically possible to make a SQL query that gets all three tables' data in a single request, but I don't think ActiveRecord has anything to do it for you. It's probably unnecessary, though. If you're not convinced you can try writing that SQL yourself and report on the performance gains.

.map(&:dup) Calculations Slow

I have an ActiveRecord query user.loans, and am using user.loans.map(&:dup) to duplicate the result. This is so that I can loop through each Loan (100+ times) and run several calculations.
These calculations take several seconds longer compared to when I run them directly on user.loans or user.loans.dup. If I do this however, all queries user.loans are affected, even when querying with different methods.
Is there an alternative to .map(&:dup) that can achieve the same result with faster calculations? I'd like to preserve the relations so that I can retrieve associated records to each Loan.
The fastest way you can achieve what you want is making calculations directly on ActiveRecord, this way you would not have to loop through resulting Array.
If you still want to loop through Array elements, maybe you should not use map to duplicate each Array element. You could use each instead, which does not affect original Array element. Here is what I think you should do:
def calculate_loans
calculated_loans = Array.new
user.loans.each do |loan|
# Here you make your calculations. For example:
calculated_loans.push(loan.value += 10)
end
calculated_loans
end
This way, you will have original user.loans elements, and a duplicated Array with calculated_loans.
Please, let me know if this improve your performance :)
To resolve conflicts with other calls to user.loans, I wound up using user.loans.reload in the Presenter I have for this particular view. This way I was able to continue making calculations directly on Active Record elsewhere(per Daniel Batalla's suggestion), but without the conflicts I mentioned in my original question.

iterating through table in Ruby using hash runs slow

I have the following code for
h2.each {|k, v|
#count += 1
puts #count
sq.each do |word|
if Wordsdoc.find_by_docid(k).tf.include?(word)
sum += Wordsdoc.find_by_docid(k).tf[word] * #s[word]
end
end
rec_hash[k] = sum
sum = 0
}
h2 -> is a hash that contain ids of documents, the hash contains more than a 1000 of these
Wordsdoc -> is a model/table in my database...
sq -> is a hash that contain around 10 words
What i'm doing is i'm going through each of the document ids and then for each word in sq i look up in the Wordsdoc table if the word exists (Wordsdoc.find_by_docid(k).tf.include?(word) , here tf is a hash of {word => value}
and if it does I get the value of that word in Wordsdoc and multiple it with the value of the word in #s which is also a hash of {word = > value}
This seems to be running very slow. Tt processe one document per second. Is there a way to process this faster?
thanks really appreciate your help on this!
You do a lot of duplicate querying. While ActiveRecord can do some caching in the background to speed things up, there is a limit to what it can do, and there is no reason to make things harder for it.
The most obvious cause for slowdown is the Wordsdoc.find_by_docid(k). For each value of k, you call it 10 times, and each time you call it there is a possibility to call it again. That means you call that method with the same argument 10-20 times for each entry in h2. Queries to the database are expensive, since the database is on the hard disk, and accessing the hard disk is expensive in any system. You can just as easily call Wordsdoc.find_by_Docid(k) once, before you enter the sq.each loop, and store it in a variable - that would save a lot of querying and make your loop go much faster.
Another optimization - though not nearly as important as the first one - is to get all the Wordsdoc records in a single query. Almost all mid to high level(and some of the low level, too!) programming languages and libraries work better and faster when they work in bulks, and ActiveRecord is no exception. If you can query for all entries of Wordsdoc, and filter them by the docid's in h2's keys, you can turn 1000 queries(after the first optimization. Before the first optimization it was 10000-20000 queries) to a single, huge query. That will enable ActiveRerocd and the underlying database to retrieve your data in bigger chunks, and save you a lot of disc access.
There are some more minor optimization you can do, but the two I've specified should be more than enough.
You're calling Wordsdoc.find_by_docid(k) twice.
You could refactor the code to:
wordsdoc = Wordsdoc.find_by_docid(k)
if wordsdoc.tf.include?(word)
sum += wordsdoc.tf[word] * #s[word]
end
...but still it will be ugly and inefficient.
You should prefetch all records in batches, see: https://makandracards.com/makandra/1181-use-find_in_batches-to-process-many-records-without-tearing-down-the-server
For example something like that should be much more efficient:
Wordsdoc.find_in_batches(:conditions => {:docid => array_of_doc_ids}).each do |wordsdoc|
if wordsdoc.tf.include?(word)
sum += wordsdoc.tf[word] * #s[word]
end
end
Also you can retrieve only certain columns from Wordsdoc table using for example :select => :tf in find_in_batches method.
As you have a lot going on I'm just going to offer you up to things to check out.
A book called Eloquent Ruby deals with Documents and iterating through documents to count the number of times a word was used. All his examples are about a Document system he was maintaining and so it could even tackle other problems for you.
inject is a method that could speed up what you're looking to do for the sum part, maybe.
Delayed Job the whole thing if you are doing this async-ly. meaning if this is a web app, you must be timing out if you're waiting a 1000 seconds for this job to complete before it shows it's answers on the screen.
Go get em.

Search a relation without a second query

My question is about how to perform varying levels of search into a database while limiting the number of queries.
Let's start simple:
#companies = Company.where("active = ?", true)
Let's say we display records from this set. Then, we need:
#clientcompanies = #companies.where("client_id = ?", #client.id)
We display something from #clientcompanies. Then, we want to drill down further.
#searchcompanies = #clientcompanies.where("name LIKE ? OR notes LIKE ?", "#{params[:search]}%", "#{params[:search]}%")
Are these three statements the most efficient way to go about this?
If indeed the database is starting with the entire Company table each time around, is there a way to limit the scope so each of the above statements would take a shorter amount of time as the size of the set diminishes?
In case it matters, I'm running Rails 3 on both MySQL and PostgreSQL.
It doesn't get much more optimized then what you're already doing. Exactly zero of those statements will execute a SQL query until you try to iterate over the results. Calling methods like all, first, inspect, any?, each etc will be when the query is executed.
Each time you chain on a new where or other arel method, it appends to the sql query that it'll execute at the end. If, somewhere in the middle, you want to see the query that'll be executed you can do puts #searchcompanies.to_sql
Note that if you run these commands in the console each statement appears to run a SQL query only because the console automatically runs .inspect on the line you entered.
Hopefully I answered your question :)
There's a great railscast here: http://railscasts.com/episodes/239-activerecord-relation-walkthrough that explains how ActiveRelation works, and what you can do with it.
EDIT:
I may have mis-understood your question. You indicated that after each where call you were displaying information from the query. What's the use-case for this? Are you displaying all companies on the same page that you have filtered-out companies from a search? If you display something from that very first query then you will be pulling every single company row from your database (which is not going to be very scalable or performant at larger quantities of company entries).
Would it not make sense to only display information from the #searchcompanies variable?

Does Ruby on Rails "has_many" array provide data on a "need to know" basis?

On Ruby on Rails, say, if the Actor model object is Tom Hanks, and the "has_many" fans is 20,000 Fan objects, then
actor.fans
gives an Array with 20,000 elements. Probably, the elements are not pre-populated with values? Otherwise, getting each Actor object from the DB can be extremely time consuming.
So it is on a "need to know" basis?
So does it pull data when I access actor.fans[500], and pull data when I access actor.fans[0]? If it jumps from each record to record, then it won't be able to optimize performance by doing sequential read, which can be faster on the hard disk because those records could be in the nearby sector / platter layer -- for example, if the program touches 2 random elements, then it will be faster just to read those 2 records, but what if it touches all elements in random order, then it may be faster just to read all records in a sequential way, and then process the random elements. But how will RoR know whether I am doing only a few random elements or all elements in random?
Why would you want to fetch 50000 records if you only use 2 of them? Then fetch only those two from DB. If you want to list the fans, then you will probably use pagination - i.e. use limit and offset in your query, or some pagination gem like will_paginate.
I see no logical explanation why should you go the way you try to. Explain a real situation so we could help you.
However there is one think you need to know wile loading many associated objects from DB - use :include like
Actor.all(:include => :fans)
this will eager-load all the fans so there will only be 2 queries instead of N+1, where N is a quantity of actors
Look at the SQL which is spewed out by the server in development mode, and that will tell you how many fan records are being loaded. In this case actor.fans will indeed cause them all to be loaded, which is probably not what you want.
You have several options:
Use a paginator as suggested by Tadas;
Set up another association with the fans table that pulls in just the ones you're interested in. This can be done either with a conditions on the has_many statement, e.g.
has_many :fans, :conditions => "country of residence = 'UK'"
Specifying the full SQL to narrow down the rows returned with the :finder_sql option
Specifying the :limit option which will, well, limit, the number of rows returned.
All depends on what you want to do.

Resources