How to reduce hitting db multiple - ruby-on-rails

Lets assume i have hundred thousand users
Simple Example,
user = User.where(id: 1..10000)
User Load (30.8ms) SELECT `users`.* FROM `users` WHERE (`users`.`id` BETWEEN 1 AND 10000)
in here, i want to slice more like this,
user.where(id: 100..1000)
User Load (2.9ms) SELECT `users`.* FROM `users` WHERE (`users`.`id` BETWEEN 1 AND 10000) AND (`users`.`id` BETWEEN 100 AND 1000)
Why is active record hitting db twice? it already has result that has bigger data. why does it have to hit db, not just reuse and slice ActiveRecord::Relation?
Is there any good solution for this?

ActiveRecord keeps track of queries and is able to cache certain duplicate requests, but in this case it's not that immediate for a library to understand that the second one is a subset of the first.
Moreover, there are several reasons why a generic library such as ActiveRecord may not want to implement a caching logic like that one. Caching a large data set in a very large application may result into several Mb of memory, and processes may reach the memory limit of the machine fairly quickly because the garbage collector would not be able to recollect the memory.
Long story short, it's a very bad idea to implement such feature in a generic ORM library.
If you want to implement it in your own code, you are free to do it.

ActiveRecord is hitting the db twice because you are running it in the console. This invokes the query on each line through .inspect. If this was run within a block of code, invocation would be delayed till you actually access user.

Instead of making two iteration pass it in single:
User.where("id between ? and ?", 100,1000)
It will reduce the db hitting, hope its a answer for your question

Related

Improving a Active Record / Postgresql Query Further

Following up on my question here, I'm trying to improve a search further. We first search a replays table (searching 2k records) and then get unique players associated with that table (10 per, so 20k records) and render a JSON. This is done through the controller, the search reads as:
def index
#replays = Replay.includes(:players).where(map_id: params['map_id'].to_i).order(id: :desc).limit(2000)
render json: #replays[0..2000].to_json(include: [:players])
end
The performance:
Completed 200 OK in 254032ms (Views: 34.1ms | ActiveRecord: 20682.4ms)
The actual Active Record search reads as:
Replay Load (80.4ms) SELECT "replays".* FROM "replays" WHERE "replays"."map_id" = $1 ORDER BY "replays"."id" DESC LIMIT $2 [["map_id", 1], ["LIMIT", 2000]]
Player Load (20602.0ms) SELECT "players".* FROM "players" WHERE "players"."replay_id" IN (117217...
This mostly works, but still takes an exceptional amount of time. Is there are way to improve performance?
You're getting bitten by this issue https://postgres.cz/wiki/PostgreSQL_SQL_Tricks_I#Predicate_IN_optimalization
I found note pg_performance about optimalization possibility of IN predicate when list of values is longer than eighty numbers. For longer list is better create constant subqueries with using multi values:
SELECT *
FROM tab
WHERE x IN (1,2,3,..n); -- n > 70
-- faster case
SELECT *
FROM tab
WHERE x IN (VALUES(10),(20));
Using VALUES is faster for bigger number of items, so don't use it for small set of values.
Basically, SELECT * FROM WHERE IN ((1),(2)...) with a long list of values is very slow. It's ridiculously faster if you can convert it to a list of values, like SELECT * FROM WHERE IN (VALUES(1),(2) ...)
Unfortunately, since this is happening in active record, it's a little tricky to exercise control over the query. You can either avoid using an includes call and just manually construct the SQL to load all your child records, and then manually build up the associations.
Alternatively, you can monkey patch active record. Here's what I've done on rails 4.2, in an initializer.
module PreloaderPerformance
private
def query_scope(ids)
if ids.count > 100
type = klass.columns_hash[association_key_name.to_s].sql_type
values_list = ids.map do |id|
if id.kind_of?(Integer)
" (#{id})"
elsif type == "uuid"
" ('#{id.to_s}'::uuid)"
else
" ('#{id.to_s}')"
end
end.join(",")
scope.where("#{association_key_name} in (VALUES #{values_list})")
else
super
end
end
end
module ActiveRecord
module Associations
class Preloader
class Association #:nodoc:
prepend PreloaderPerformance
end
end
end
end
Doing this I've seen a 50x speed up of some of my queries, with no issues as of yet. Note it's not fully battle tested, and I bet it will have some issues if you're association is using a unique data type for the foreign_key relationship. In my data base, I only use uuids or integers for our associations. The usual caveats about monkey patching core rails behavior applies.
I know find_each can be used to batch queries, which might lighten the memory load here. Could you try out the following and see how it impacts upon the time?
Replay.where(map_id: params['map_id'].to_i).includes(:players).find_each(batch_size: 100).map do |replay|
replay.to_json(includes: :players)
end
I'm not sure this will work. It might be the mapping negates the benefits of batching - there are certainly more queries, but it'll use less memory as it doesn't need to store > 20k records at a time.
Have a play and see how it looks - fiddle with the batch size too, see how that affects things.
There's a caveat in that you can't apply a limit, so bear that in mind.
I'm sure someone else'll come up with a far slicker solution, but hope this might help in the meantime. If it's awful when you check the speed, let me know and I'll delete this answer :)

ActiveRecord querying a tree structure efficiently

I have inherited a Rails 3 app that stores much of it's data as a fairly sophisticated tree structure. The application works pretty well in general but we are noticing some problems with performance, mostly around database interactions.
I am seeing a lot of queries along the lines of these showing up in my logs:
SELECT `messages`.* FROM `messages` WHERE `messages`.`context_type` = 'Node' AND `messages`.`context_id` IN (153740, 153741, /* about a thousand records... */ 154837, 154838, 154839, 154840, 154841, 154842, 154843)
Followed by many single queries where it looks as though the same record is being queried time and again:
[1m[35mCACHE (0.0ms)[0m SELECT `plans`.* FROM `plans` WHERE `plans`.`type` IN ('Scenario') AND `plans`.`id` = 1435 LIMIT 1
My log has that exact query roughly eighty times- now I'm guessing that initial Cache message means it is probably being pulled from a cache rather than going back to the database every time, but it still looks like a lot and this type of thing is happening repeatedly.
I am guessing that the above queries are an association being pulled out backwards so that message belongs_to plan and it is loading all the messages then pulling out the plan for each one rather than, as one might do in a sane world, starting with the plan and then loading all the messages.
Working in this vein, a single request contains 1641 SELECT statements and it seems very likely to me that the sheer amount of database traffic ( not to mention the number of sequential LIMIT 1 queries for neighbouring data in the same table ) is a significant bottleneck. I am reluctant to post too much code but this is a typical example of one of the larger queries:
def nodes
include_objects = [:sector, :market, :messages, :node_user_prefs, :reference_node, :project, {:available_events => :available_event_nodes}, :customer_factors, :active_components, :tags, { :event_histories => :node}, {:event_histories => :user }]
project = self
#cached_nodes ||= begin
all_nodes = orig_nodes.includes(include_objects)
all_nodes = all_nodes.map { |n| n.tap { |i| i.cached_project = project } }
all_node_ids = all_nodes.map(&:id)
all_nodes.select{ |n| n.type == 'InitialNode' || all_node_ids.include?(n.parent_node_id) }
end
end
Obviously, the queries are pretty diverse and the application is large, but this is fairly representative of the standard approach taken.
What are the easy wins with ActiveRecord that I can use to try and speed things up? I can easily enough put together a query that would pull all the required data out in a single round trip, but how easy would it be to form that - redundancies and all - into my model hierarchy? What other techniques can I use to help out here?
Ancestry Gem
Not a direct fix by any means, but you may wish to consider the ancestry gem -
This will give you a way to create a tree structure, whereby you'll be able to call single records & then have their descendents called a you wish. This will cut back on your SQL queries:
If you set up your nodes / objects in this fashion, it will allow you to call the records you require once & ancestry will populate the rest. If you want me to divulge more information on this, please let me know in the comments & I'll detail more specifics

Need advise : how to handle huge data to summarise a report in php

I am looking for advice to handle following situation.
I have report which shows list of products; each product has a number of times it has been viewed and also the number of times the order has been requested.
Looking in to DB I feel its not good. There are three tables participating :
product
product_view
order_item
The following SELECT query is executed
select product_title,
(select count(views) from product_view pv where p.pid=pv.pid) as product_view ,
(select count(placed) from order_item o where p.pid=o.pid) as product_request_count
From product p
order by product_title
Limit 0,10
This query returns 10 records successfully; However, it is very time consuming to load. Also when the user uses the export functionality approximately 2,000,000 records would be returned however I get a memory exhaust error.
I am not able to find the most suitable solution for this in ZF2[PHP+MySql]
Can someone suggest some good strategy to deal?
How about using background processes? It doesn't have to be purely ZF2.
And once the background process is done, the system will notify to user via email that the export is done. :)
You can:
call set_time_limit(0) to inter the execution time limitation.
loop through the whole result set in lumps of, say, 1000 records, and output to the user the result sequentially.

How to randomize (and paginate) large set of results?

I am creating a contest application that requires the main index page of entries to be randomized. As it will potentially be a large set of entries (maybe up to 5000), I will also need to paginate them.
Here are the challenges:
I have read that using a database's 'random()' function on a large set can perform poorly.
I would like for things to not be re-randomized when the pagination links are clicked. In other words, it should return a random set upon first load and then keep the same order while someone uses the pagination.
The second challenge seems potentially unrealistic, but perhaps there are some create solutions out there?
Thanks for any input.
a simple way I suggest is writing your own random function with SQL query, for the function more complicated the more random, for example:
you already know
select * from your_table order by rand() limit 0, 10
assume your_table has a primary key "id", now replace "rand()" with "MOD(id, 13)"
select * from your_table order by MOD(id, 13) limit 0,10
if your_table has a datetime column, the result would be better, try this query:
select * from your_table order by MOD(id, 13), updated_at limit 0,10
also if you don't think it's not random enough, there is I bet you love it:
select * from your_table order by MD5(id) limit 0, 10
I would just use a random number generator to select IDs, and store the seed in the session so a user will see the same ordering while paginating. I would probably also use a hash to make sure each ID is picked only once.

LINQ to SQL Pagination and COUNT(*)

I'm using the PagedList class in my web application that many of you might be familiar with if you have been doing anything with ASP.NET MVC and LINQ to SQL. It has been blogged about by Rob Conery, and a similar incarnation was included in things like Nerd Dinner, etc. It works great, but my DBA has raised concerns about potential future performance problems.
His issue is around the SELECT COUNT(*) that gets issued as a result of this line:
TotalCount = source.Count();
Any action that has paged data will fire off an additional query (like below) as a result of the IQueryable.Count() method call:
SELECT COUNT(*) AS [value] FROM [dbo].[Products] AS [t0]
Is there a better way to handle this? I considered using the Count property of the PagedList class to get the item count, but realized that this won't work because it's only counting the number of items currently displayed (not the total count).
How much of a performance hit will this cause to my application when there's a lot of data in the database?
iirc this stuff is a part of index stats and should be very efficient, you should ask your DBA to substatiate his concerns, rather than prematurely optimising.
Actually, this is a pretty common issue with Linq.
Yes, index stats will get used if the statement was only SELECT COUNT(*) AS [value] FROM [dbo].[Products] AS [t0] but 99% of the time its going to contain WHERE statements as well.
So basically two SQL statements are executed:
SELECT COUNT(*) AS [value] FROM [dbo].[Products] AS [t0] WHERE blah=blah and someint=500
SELECT blah, someint FROM [dbo].[Products] AS [t0] WHERE blah=blah and someint=500
You start receiving problems if the table is updated often as the COUNT(*) returned in the first statement doesnt equal the second statement...this may return an error message 'Row not found or changed.'
Some databases (Oracle, Postgresql, SQL Server I think) keep a record of row counts in the system tables; though these are sometimes only accurate to the point at which the statistics were last refreshed (Oracle). You could use this approach, if you only need a fairly-accurate-but-not-exact metric.
Which database are you using, or does that vary?
(PS I know that you are talking about MsSQL however)
I am no DBA but count(*) in MySQL is a real performance hit. Simply changing this to count(ID) really does improve the speed.
I came across this when I was querying a table with very large GLOB (Images) data. The query tool around 15 seconds to load. Changing the query to count(id) reduced the query to 0.02. Still a little slow but a hell of a lot better.
I think this is what the DBA is getting at. I have noticed then when debuggin Linq the statement that counts takes a very long time (1 second) to jump to the next statement.
Based on my finding I have to agree with the DBA's conserns...

Resources