iterating through table in Ruby using hash runs slow - ruby-on-rails

I have the following code for
h2.each {|k, v|
#count += 1
puts #count
sq.each do |word|
if Wordsdoc.find_by_docid(k).tf.include?(word)
sum += Wordsdoc.find_by_docid(k).tf[word] * #s[word]
end
end
rec_hash[k] = sum
sum = 0
}
h2 -> is a hash that contain ids of documents, the hash contains more than a 1000 of these
Wordsdoc -> is a model/table in my database...
sq -> is a hash that contain around 10 words
What i'm doing is i'm going through each of the document ids and then for each word in sq i look up in the Wordsdoc table if the word exists (Wordsdoc.find_by_docid(k).tf.include?(word) , here tf is a hash of {word => value}
and if it does I get the value of that word in Wordsdoc and multiple it with the value of the word in #s which is also a hash of {word = > value}
This seems to be running very slow. Tt processe one document per second. Is there a way to process this faster?
thanks really appreciate your help on this!

You do a lot of duplicate querying. While ActiveRecord can do some caching in the background to speed things up, there is a limit to what it can do, and there is no reason to make things harder for it.
The most obvious cause for slowdown is the Wordsdoc.find_by_docid(k). For each value of k, you call it 10 times, and each time you call it there is a possibility to call it again. That means you call that method with the same argument 10-20 times for each entry in h2. Queries to the database are expensive, since the database is on the hard disk, and accessing the hard disk is expensive in any system. You can just as easily call Wordsdoc.find_by_Docid(k) once, before you enter the sq.each loop, and store it in a variable - that would save a lot of querying and make your loop go much faster.
Another optimization - though not nearly as important as the first one - is to get all the Wordsdoc records in a single query. Almost all mid to high level(and some of the low level, too!) programming languages and libraries work better and faster when they work in bulks, and ActiveRecord is no exception. If you can query for all entries of Wordsdoc, and filter them by the docid's in h2's keys, you can turn 1000 queries(after the first optimization. Before the first optimization it was 10000-20000 queries) to a single, huge query. That will enable ActiveRerocd and the underlying database to retrieve your data in bigger chunks, and save you a lot of disc access.
There are some more minor optimization you can do, but the two I've specified should be more than enough.

You're calling Wordsdoc.find_by_docid(k) twice.
You could refactor the code to:
wordsdoc = Wordsdoc.find_by_docid(k)
if wordsdoc.tf.include?(word)
sum += wordsdoc.tf[word] * #s[word]
end
...but still it will be ugly and inefficient.
You should prefetch all records in batches, see: https://makandracards.com/makandra/1181-use-find_in_batches-to-process-many-records-without-tearing-down-the-server
For example something like that should be much more efficient:
Wordsdoc.find_in_batches(:conditions => {:docid => array_of_doc_ids}).each do |wordsdoc|
if wordsdoc.tf.include?(word)
sum += wordsdoc.tf[word] * #s[word]
end
end
Also you can retrieve only certain columns from Wordsdoc table using for example :select => :tf in find_in_batches method.

As you have a lot going on I'm just going to offer you up to things to check out.
A book called Eloquent Ruby deals with Documents and iterating through documents to count the number of times a word was used. All his examples are about a Document system he was maintaining and so it could even tackle other problems for you.
inject is a method that could speed up what you're looking to do for the sum part, maybe.
Delayed Job the whole thing if you are doing this async-ly. meaning if this is a web app, you must be timing out if you're waiting a 1000 seconds for this job to complete before it shows it's answers on the screen.
Go get em.

Related

Randomize Selections in a List of 100

This is a follow-up to this last question I asked: Sort Users by Number of Followers. That code is:
#ordered_users = User.all.sort{|a,b| b.followers.count <=> a.followers.count}
What I hope to accomplish is take the ordered users and get the top 100 of those and then randomly choose 5 out of that 100. Is there a way to accomplish this?
Thanks.
users_in_descending_order_of_followers = User.all.sort_by { |u| -u.followers.count }
sample_of_top = users_in_descending_order_of_followers.take(100).sample(5)
You can use sort_by which can be easier to use than sort, and combine take and sample to get the top 100 users and sample 5 of those users.
User.all.sort can "potentially" pose some problems in the long-run, depending on the number of total users, and the availability of resources particularly computer memory, not to mention it would be a lot slower because you're calling 2x .followers.count inside the sort block, which essentially calls 2xN times more DB query; N being the number of users. This is because User.all.sort will immediately execute the User.all query, thereby fetching all User records into memory, as opposed to your usual User.all, which is lazy loaded, until you (for example use .each, or better yet .find_each somewhere down the line)
I suggest something like below (I extended Deekshith's answer referring to your link to the other question):
User.joins(:followers).order('count(followers.user_id) desc').limit(100).sample(5)
.joins, .order, and .limit above will all extend the SQL string query into one string, then executes that SQL string, and finally run .sample(5) (not a SQL anymore!, but is already just a plain ruby method at this point), finally yielding the result that you needed.
I would strongly consider using a counter cache on the User model, to hold the count of followers.
This would give a very small performance impact on adding or removing followers, and greatly increase performance when performing sorts:
User.order(followers_count: :desc)
This would be particularly noticeable if you wanted the top-n users by follower count, or finding users with no followers.
User.order(followers_count: :desc).limit(100).sample(5)
This method will out-perform others using count(*). Add an index on followers_count for best effect.

How to efficiently fetch n most recent rows with GROUP BY in sqlite?

I have a table of event results, and I need to fetch the most recent n events per player for a given list of players.
This is on iOS so it needs to be fast. I've looked at a lot of top-n-per-group solutions that use subqueries or joins, but these run slow for my 100k row dataset even on a macbook pro. So far my dumb solution, since I will only run this with a maximum of 6 players, is to do 6 separate queries. It isn't terribly slow, but there has to be a better way, right? Here's the gist of what I'm doing now:
results_by_pid = {}
player_ids = [1,2,3,4,5,6]
n_results = 6
for pid in player_ids:
results_by_pid[pid] = exec_sql("SELECT *
FROM results
WHERE player_id = #{pid}
ORDER BY event_date DESC
LIMIT n_events")
And then I go on my merry way. But how can I turn this into a single fast query?
There is no better way.
SQL window functions, which might help, are not implemented in SQLite.
SQLite is designed as an embedded database where most of the logic stays in the application.
In contrast to client/server databases where network communication should be avoided, there is no performance disadvantage to mixing SQL commands and program logic.
A less dumb solution requires you to do some SELECT player_id FROM somewhere beforehand, which should be no trouble.
To make the individual queries efficient, ensure you have one index on the two columns player_id and event_date.
This won't be much of an answer, but here goes...
I have found that making things really quick can involve ideas from the nature of the data and schema themselves. For example, searching an ordered list is faster than searching an unordered list, but you have to pay a cost up front - both in design and execution.
So ask yourself if there are any natural partitions on your data that may reduce the number of records SQLite must search. You might ask whether the latest n events fall within a particular time period. Will they all be from the last seven days? The last month? If so then you can construct the query to rule out whole chunks of data before performing more complex searches.
Also, if you just can't get the thing to work quickly, you can consider UX trickery! Soooooo many engineers don't get clever with their UX. Will your query be run as the result of a view controller push? Then set the thing going in a background thread from the PREVIOUS view controller, and let it work while iOS animates. How long does a push animation take? .2 seconds? At what point does your user indicate to the app (via some UX control) which playerids are going to be queried? As soon as he touches that button or TVCell, you can prefetch some data. So if the total work you have to do is O(n log n), that means you can probably break it up into O(n) and O(log n) pieces.
Just some thoughts while I avoid doing my own hard work.
More thoughts
How about a separate table that contains the ids of the previous n inserts? You could add a trigger to delete old ids if the size of the table grows above n. Say..
CREATE TABLE IF NOT EXISTS recent_results
(result_id INTEGER PRIMARY KEY, event_date DATE);
// is DATE a type? I don't know. you get the point
CREATE TRIGGER IF NOT EXISTS optimizer
AFTER INSERT ON recent_results
WHEN (SELECT COUNT(*) FROM recent_results) > N
BEGIN
DELETE FROM recent_results
WHERE result_id = (SELECT result_id
FROM recent_results
WHERE event_date = MIN(event_date));
// or something like that. I have no idea if this will work,
// I just threw it together.
Or you could just create a temporary memory-based table that you populate at app load and keep up to date as you perform transactions during app execution. That way you only pay the steep price once!
Just a few more thoughts for you. Be creative, and remember that you can usually define what you want as a data structure as well as an algorithm. Good luck!

How to improve an ActiveRecord query

I have a method which returns the results from a query. The code that calls the method then loops through each result and launches a sidekiq worker. The problem I'm running into is that the looping is actually taking a decent amount of time (almost the same amount of time it takes to run all the workers). Here's the query:
Object.where("last_updated > ?" , 1.days.ago.midnight )
I then do the following:
objects.each { |o| o.perform_async(something) }
I'm trying to figure out how to make this process more efficient. The result is that it's taking around 10 minutes for this process to complete, effectively taking 20 milliseconds per launch (if the query returns 30,000 results). Is there any way to make this quicker?
I see that you already have last_updated indexed. Next:
Object.select('id, only_columns_you_need').where(...).find_each do |object|
object.perform_async(something)
end
If the "objects" table has lots of columns but you only need a few for this operation, selecting only those columns can really speed things up in both db and Ruby land.
find_each will load the records in batches of 1000 by default. Use the :batch_size option to tweak that.
UPDATE
def do_stuff_to_objects(&stuff)
Object.select('id, only_columns_you_need').where(...).find_each(&stuff)
end
...
do_stuff_to_objects do |object|
object.perform_async(something)
end

ActiveRecord query much slower than straight SQL?

I've been working on optimizing my project's DB calls and I noticed a "significant" difference in performance between the two identical calls below:
connection = ActiveRecord::Base.connection()
pgresult = connection.execute(
"SELECT SUM(my_column)
FROM table
WHERE id = #{id}
AND created_at BETWEEN '#{lower}' and '#{upper}'")
and the second version:
sum = Table.
where(:id => id, :created_at => lower..upper).
sum(:my_column)
The method using the first version on average takes 300ms to execute (the operation is called a couple thousand times total within it), and the method using the second version takes about 550ms. That's almost 100% decrease in speed.
I double-checked the SQL that's generated by the second version, it's identical to the first with exception for it prepending table columns with the table name.
Why the slow-down? Is the conversion between ActiveRecord and SQL really making the operation take almost 2x?
Do I need to stick to writing straight SQL (perhaps even a sproc) if I need to perform the same operation a ton of times and I don't want to hit the overhead?
Thanks!
A couple of things jump out.
Firstly, if this code is being called 2000 times and takes 250ms extra to run, that's ~0.125ms per call to convert the Arel to SQL, which isn't unrealistic.
Secondly, I'm not sure of the internals of Range in Ruby, but lower..upper may be doing calculations such as the size of the range and other things, which will be a big performance hit.
Do you see the same performance hit with the following?
sum = Table.
where(:id => id).
where(:created_at => "BETWEEN ? and ?", lower, upper).
sum(:my_column)

Ruby on Rails: Model.all.each vs find_by_sql("SELECT * FROM model").each?

I'm fairly new to RoR. In my controller, I'm iterating over every tuple in the database. For every table, for every column I used to call
SomeOtherModel.find_by_sql("SELECT column FROM model").each {|x| #etc }
which worked fine enough. When I later changed this to
Model.all(:select => "column").each {|x| #etc }
the loop starts out at roughly the same speed but quickly slows down to something like 100 times slower than the the find_by_sql command. These calls should be identical so I really don't know what's happening.
I know these calls are not the most efficient but this is just an intermediate step and I will optimize it more once this works correctly.
So to clarify: Why in the world does calling Model.all.each run so much slower than using find_by_sql.each?
Thanks!
Both calls do make the same SQL call, so they both should be roughly the same speed. Model.all does go through an extra level of indirection. I think that Rails in all collects all the rows from the DB connection, then calls that collection as a param to another method that loops over that collection and creates the model objects. In find_by_sql it does it all in 1 method. Maybe that is a little faster. How large is your data set? If the set is over 100 or so records you shouldn't use all, but use find_in_batches. It uses limits and offsets to run your code in batches and not load the entire table at once into memory. You do need to include the primary key in the select since it uses that to do the order and limit. Example:
Model.find_in_batches(:batch_size => 100) do |group|
group.each {|item| item.do_something_interesting }
end
ScottD is correct. The find_by_sql call does not create the slowdown you are seeing B_. The speed issues are due to the fact that Model.all loads a model instance into memory for every single record fetched. This can result in complete failure with a large enough result set. FYI, this is covered very well in the Rails Guides here: http://guides.rubyonrails.org/active_record_querying.html#retrieving-multiple-objects

Resources