Efficient select and distinct on a large association

Efficient select and distinct on a large association - ruby-on-rails

I have three models: Catalog, Product and Value.
The Value table has a characteristic_id column, and I'd like to get the list of different characteristic_id on a set of values.
The relationships are:
a catalog has many products
a product has many values
Here is the query I came up with:
Value.joins(:product).select(:characteristic_id).distinct.where(products: {catalog_id: catalog.id}).pluck(:characteristic_id)
=> [441, 2582, 3133]
which returns the right result, but it is extremely slow on a large catalog with a million products (about 50 seconds).
I can't find a more efficient way to do this.
Here is an EXPLAIN of the query:
=> EXPLAIN for: SELECT DISTINCT "values"."characteristic_id" FROM "values" INNER JOIN "products" ON "products"."id" = "values"."product_id" WHERE "products"."catalog_id" = $1 [["catalog_id", 1767]]
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=1515106.82..1515109.15 rows=233 width=4)
Group Key: "values".characteristic_id
-> Hash Join (cost=124703.76..1492245.65 rows=9144469 width=4)
Hash Cond: ("values".product_id = products.id)
-> Seq Scan on "values" (cost=0.00..1002863.07 rows=34695107 width=8)
-> Hash (cost=114002.20..114002.20 rows=652285 width=4)
-> Bitmap Heap Scan on products (cost=12311.64..114002.20 rows=652285 width=4)
Recheck Cond: (catalog_id = 1767)
-> Bitmap Index Scan on index_products_on_catalog_id (cost=0.00..12148.57 rows=652285 width=0)
Index Cond: (catalog_id = 1767)
(10 rows)
Any idea on how to run this query faster?

Make sure you have indexes on both foreign keys:
index "values"."product_id"
index "products"."catalog_id"

Try to add an index on values.characteristic_id.
Oftentimes GROUP BY may be faster than DISTINCT :
Value.joins(:product).where(products: {catalog_id: catalog.id}).select(:characteristic_id).group(:characteristic_id).pluck(:characteristic_id)

Related

Multi-column indices ordering by date and created_at exhibit strange behavior for different queries

On postgres 10, I have a query like so, for a table with millions of rows, to grab the latest posts belonging to classrooms:
SELECT "posts".*
FROM "posts"
WHERE "posts"."school_id" = 1
AND "posts"."classroom_id" IN (10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
ORDER BY date desc, created_at desc
LIMIT 30 OFFSET 30;
Assume that classrooms only belong to one school.
I have an index like so:
t.index ["date", "created_at", "school_id", "classroom_id"], name: "optimize_post_pagination"
When I run the query, it does an index scan backwards like I'd hope and return in 0.7ms.
Limit (cost=127336.95..254673.34 rows=30 width=494) (actual time=0.189..0.242 rows=30 loops=1)
-> Index Scan Backward using optimize_post_pagination on posts (cost=0.56..1018691.68 rows=240 width=494) (actual time=0.103..0.236 rows=60 loops=1)
Index Cond: (school_id = 1)
" Filter: (classroom_id = ANY ('{10,11,...}'::integer[]))"
Planning time: 0.112 ms
Execution time: 0.260 ms
However, when I change the query to only include a couple classrooms:
SELECT "posts".*
FROM "posts"
WHERE "posts"."school_id" = 1
AND "posts"."classroom_id" IN (10, 11)
ORDER BY date desc, created_at desc
LIMIT 30 OFFSET 30;
It freaks out and does a lot of extra work, taking nearly 4 sec:
-> Sort (cost=933989.58..933989.68 rows=40 width=494) (actual time=3857.216..3857.219 rows=60 loops=1)
" Sort Key: date DESC, created_at DESC"
Sort Method: top-N heapsort Memory: 61kB
-> Bitmap Heap Scan on posts (cost=615054.27..933988.51 rows=40 width=494) (actual time=2700.871..3851.518 rows=18826 loops=1)
Recheck Cond: (school_id = 1)
" Filter: (classroom_id = ANY ('{10,11}'::integer[]))"
Rows Removed by Filter: 86099
Heap Blocks: exact=29256
-> Bitmap Index Scan on optimize_post_pagination (cost=0.00..615054.26 rows=105020 width=0) (actual time=2696.385..2696.385 rows=104925 loops=1)
Index Cond: (school_id = 485)
What's even stranger is that if I drop the WHERE clause for school_id, both cases for classrooms (with a few or with many) runs fast with the backwards index scan.
This index cookbook suggests putting the ORDER BY index columns last, like so:
t.index ["school_id", "classroom_id", "date", "created_at"], name: "activity_page_index"
But that makes my queries slower, even though the cost is much lower.
Limit (cost=993.93..994.00 rows=30 width=494) (actual time=208.443..208.452 rows=30 loops=1)
-> Sort (cost=993.85..994.45 rows=240 width=494) (actual time=208.436..208.443 rows=60 loops=1)
" Sort Key: date DESC, created_at DESC"
Sort Method: top-N heapsort Memory: 118kB
-> Index Scan using activity_page_index on posts (cost=0.56..985.56 rows=240 width=494) (actual time=0.032..178.147 rows=102403 loops=1)
" Index Cond: ((school_id = 1) AND (classroom_id = ANY ('{10,11,...}'::integer[])))"
Planning time: 0.132 ms
Execution time: 208.482 ms
Interestingly, with the activity_page_index query, it does not change its behavior when querying with fewer classrooms.
So, a few questions:
With the original query, why would the number of classrooms make such a massive difference?
Why does dropping the school_id WHERE clause make both cases run fast?
Why does dropping the school_id WHERE clause make both cases run fast, even though the index still includes school_id?
How can a high cost query finish quickly (65883 -> 0.7ms) and a lower cost query finish slower (994 -> 208ms)?
Other notes
It is necessary to order by both date and created_at, even though they may seem redundant.

Your first plan as shown seems impossible for your query as shown. The school_id = 1 criterion should show up either as an index condition, or as a filter condition, but you don't show it in either one.
With the original query, why would the number of classrooms make such a massive difference?
With the original plan, it is getting the rows in the desired order by walking the index. Then it gets to stop early once it accumulates 60 rows which meet the non-index criteria. So the more selective those other criteria are, the most of the index it needs to walk before it gets enough rows to stop early. Removing classrooms from the list makes it more selective, so makes that plan look less attractive. At some point, it crosses a line where it looks less attractive than something else.
Why does dropping the school_id WHERE clause make both cases run fast?
You said that every classroom belongs to only one school. But PostgreSQL does not know that, it thinks the two criteria are independent, so gets the overall estimated selectivity by multiplying the two separate estimates. This gives it a very misleading estimate of the overall selectivity, which makes the already-ordered index scan look worse than it really is. Not specifying the redundant school_id prevents it from making this bad assumption about the independence of the criteria. You could create multi-column statistics to try to overcome this, but in my hands this doesn't actually help you on this query until v13 (for reasons I don't understand).
This is about the estimation process, not the execution. So school_id being in the index or not doesn't matter.
How can a high cost query finish quickly (65883 -> 0.7ms) and a lower cost query finish slower (994 -> 208ms)?
"It is difficult to make predictions, especially about the future." Cost estimates are predictions. Sometimes they don't work out very well.

Is PGSQL not executing my index because of the ORDER BY clause?

I have a rails query that looks like this:
Person.limit(10).unclaimed_people({})
def unclaimed_people(opts)
sub_query = where.not(name_id: nil)
if opts[:q].present?
query_with_email_or_name(sub_query, opts)
else
sub_query.group([:name_id, :effective_name])
.reorder('MIN(COALESCE(kudo_position,999999999)), lower(effective_name)')
.select(:name_id)
end
end
Translating to SQL, the query looks like this:
SELECT "people"."name_id" FROM "people"
WHERE ("people"."name_id" IS NOT NULL)
GROUP BY "people"."name_id", "people"."effective_name"
ORDER BY MIN(COALESCE(kudo_position,999999999)), lower(effective_name) LIMIT 10
Now when I run an EXPLAIN on the SQL query, what's returned shows that I am not running an index scan. Here is the EXPLAIN:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=728151.18..728151.21 rows=10 width=53) (actual time=6333.027..6333.028 rows=10 loops=1)
-> Sort (cost=728151.18..729171.83 rows=408258 width=53) (actual time=6333.024..6333.024 rows=10 loops=1)
Sort Key: (min(COALESCE(kudo_position, 999999999))), (lower(effective_name))
Sort Method: top-N heapsort Memory: 25kB
-> GroupAggregate (cost=676646.88..719328.87 rows=408258 width=53) (actual time=4077.902..6169.151 rows=946982 loops=1)
Group Key: name_id, effective_name
-> Sort (cost=676646.88..686041.57 rows=3757877 width=21) (actual time=4077.846..5106.606 rows=3765261 loops=1)
Sort Key: name_id, effective_name
Sort Method: external merge Disk: 107808kB
-> Seq Scan on people (cost=0.00..112125.78 rows=3757877 width=21) (actual time=0.035..939.682 rows=3765261 loops=1)
Filter: (name_id IS NOT NULL)
Rows Removed by Filter: 317644
Planning time: 0.130 ms
Execution time: 6346.994 ms
Pay attention to the bottom part of the query plan. There is a Seq Scan on people. This is not what I was expecting, in my development and production database, I have placed an index on the foreign name_id field. Here is the proof from the people table.
"index_people_name_id" btree (name_id) WHERE name_id IS NOT NULL
So my question is why would it not be running the index. Could it perhaps be from the ORDER BY clause. I read that it could affect the execution of an index. This is the web page where I read it from. Why isn't my index being used?
In particular here is the quote from the page.
Indexes are normally not used for ORDER BY or to perform joins. A sequential scan followed by an explicit sort is usually faster than an index scan of a large table. However, LIMIT combined with ORDER BY often will use an index because only a small portion of the table is returned.
As you can see from the query, I am using ORDER BY combined with LIMIT so I would expect the index to be used. Could this be outdated? Is the ORDER BY really affecting the query? What am I missing to get the index to work? I'm not particularly versed with the internals of PGSQL so any help would be appreciated.

Why is performance of joining tables so much faster than joining subqueries

I want to join two big tables (711147 and 469519 rows). But I need just subsets of these tables (44.593 rows and 28.191 rows). When I create temp tables containing the subsets, the join is very quick (below 1 second). When I use subqueries or views, it takes 5 to 10 minutes.
The problem is, that every time when I use this query, the subset (jahr = 2016) has changed. So using the "fast way", each time using it, I would have to recreate the tmp tables first. Problem is, that this query itself is the basis of a view, and I don't know, when the view is used.
The fast way with temp tables looks like this:
select rechnung, art into temp rng16 from rng where jahr = 2016;
select rechnung, artikel, menge, epreis into temp fla16 from fla where jahr = 2016;
explain analyse select * from rng16 natural join fla16;
and the result is:
Merge Join (cost=4783.18..27406.15 rows=1500012 width=104) (actual time=544.691..679.280 rows=44593 loops=1)
Merge Cond: (rng16.rechnung = fla16.rechnung)
-> Sort (cost=1681.83..1714.72 rows=13158 width=64) (actual time=222.233..233.251 rows=27630 loops=1)
Sort Key: rng16.rechnung
Sort Method: external merge Disk: 520kB
-> Seq Scan on rng16 (cost=0.00..284.58 rows=13158 width=64) (actual time=0.009..2.880 rows=28191 loops=1)
-> Materialize (cost=3101.35..3215.35 rows=22800 width=72) (actual time=322.449..362.445 rows=44593 loops=1)
-> Sort (cost=3101.35..3158.35 rows=22800 width=72) (actual time=322.444..356.178 rows=44593 loops=1)
Sort Key: fla16.rechnung
Sort Method: external merge Disk: 1248kB
-> Seq Scan on fla16 (cost=0.00..513.00 rows=22800 width=72) (actual time=0.008..7.832 rows=44593 loops=1)
Total runtime: 682.589 ms
but doing it "on the fly" with two subqueries
explain analyse select * from (select rechnung, art from rng where jahr=2016) rng16 natural join (select rechnung, artikel, menge, epreis from fla where jahr = 2016) fla16;
lasts for ages. Output of explain is:
Nested Loop (cost=0.85..10.98 rows=1 width=21) (actual time=0.036..453240.711 rows=44593 loops=1)
Join Filter: (rng.rechnung = fla.rechnung)
Rows Removed by Join Filter: 1257076670
-> Index Scan using rng_jahr on rng (cost=0.42..5.51 rows=1 width=9) (actual time=0.017..54.372 rows=28191 loops=1)
Index Cond: (jahr = 2016)
-> Index Scan using fla_jahr on fla (cost=0.42..5.46 rows=1 width=19) (actual time=0.020..9.875 rows=44593 loops=28191)
Index Cond: (jahr = 2016)
Total runtime: 453253.579 ms

Instead of using subqueries, try joining tables, like here:
explain analyse
select
rng16.rechnung, rng16.art
fla.rechnung, fla.artikel, fla.menge, flaepreis
from
rng rng16
natural join
fla
where
rng16.jahr = 2016
and fla.jahr = 2016
Is it still a nested loop?

Create postgres index for table with inner join in RubyOnRails

I have an app based on RubyOnRails 4.0. I have two models: Stores and Products. There are about 1.5 million products in the system making it quite slow if I do not use indices properly.
Some basic info
Store has_many Products
Store.affiliate_type_id is used where 1=Affiliated 2=Not affiliated
Products have attributes like "category_connection_id" (integer) and "is_available" (boolean)
In FeededProduct model:
scope :affiliated, -> { joins(:store).where("stores.affiliate_type_id = 1") }
This query takes about 500ms which basically interrupts the website:
FeededProduct.where(:is_available => true).affiliated.where(:category_connection_id => #feeded_product.category_connection_id)
Corresponding postgresql:
FeededProduct Load (481.4ms) SELECT "feeded_products".* FROM "feeded_products" INNER JOIN "stores" ON "stores"."id" = "feeded_products"."store_id" WHERE "feeded_products"."is_available" = 't' AND "feeded_products"."category_connection_id" = 345 AND (stores.affiliate_type_id = 1)
Update. Postgresql EXPLAIN:
QUERY PLAN
-------------------------------------------------------------------------------------------------
Hash Join (cost=477.63..49176.17 rows=21240 width=1084)
Hash Cond: (feeded_products.store_id = stores.id)
-> Bitmap Heap Scan on feeded_products (cost=377.17..48983.06 rows=38580 width=1084)
Recheck Cond: (category_connection_id = 5923)
Filter: is_available
-> Bitmap Index Scan on cc_w_store_index_on_fp (cost=0.00..375.25 rows=38580 width=0)
Index Cond: ((category_connection_id = 5923) AND (is_available = true))
-> Hash (cost=98.87..98.87 rows=452 width=4)
-> Seq Scan on stores (cost=0.00..98.87 rows=452 width=4)
Filter: (affiliate_type_id = 1)
(10 rows)
Question: How can I create an index that will take the inner join into consideration and make this faster?

That depends on the join algorithm that PostgreSQL chooses. Use EXPLAIN on the query to see how PostgreSQL processes the query.
These are the answers depending on the join algorithm:
nested loop join
Here you should create an index on the join condition for the inner relation (the bottom table in the EXPLAIN output). You may further improve things by adding columns that appear in the WHERE clause and significantly improve selectivity (i.e., significantly reduce the number of rows filtered out during the index scan.
For the outer relation, an index on the columns that appear in the WHERE clause will speed up the query if these conditions filter out most of the rows in the table.
hash join
Here it helps to have indexes on both tables on those columns in the WHERE clause where the conditions filter out most of the rows in the table.
merge join
Here you need indexes on the columns in the merge condition to allow PostgreSQL to use an index scan for sorting. Additionally, you can append columns that appear in the WHERE clause.
Always test with EXPLAIN if your indexes get used. If not, odds are that either they cannot be used or that using them would make the query slower than a sequential scan, e.g. because they do not filter out enough rows.

Uniqueness case-sensitive false causes slow query

I have the following validation:
validates :username, uniqueness: { case_sensitive: false }
Which causes the following query to be run painfully slow:
5,510 ms
SELECT ? AS one FROM "users" WHERE (LOWER("users"."username") = LOWER(?) AND "users"."id" != ?) LIMIT ?
Explain plan
1 Query plan Limit (cost=0.03..4.03 rows=1 width=0)
2 Query plan -> Index Scan using idx_users_lower_username on users (cost=0.03..4.03 rows=1 width=0)
3 Query plan Index Cond: ?
4 Query plan Filter: ?
The index was created in my structure.sql using CREATE INDEX idx_users_lower_username ON users USING btree (lower((username)::text)); See my question How to create index on LOWER("users"."username") in Rails (using postgres) for more on this.
This is using the index I set and still takes over 5 seconds? What's wrong here?

There are several different, interrelated things going on here. Exactly how you carry out the changes depends on how you manage changes to your database structure. The most common way is to use Rails migrations, but your linked question suggests you're not doing that. So I'll speak mostly in SQL, and you can adapt that to your method.
Use a sargable WHERE clause
Your WHERE clause isn't sargable. That means it's written in a way that prevents the dbms from using an index. To create an index PostgreSQL can use here . . .
create index on "users" (lower("username") varchar_pattern_ops);
Now queries on lowercased usernames can use that index.
explain analyze
select *
from users
where lower(username) = lower('9LCDgRHk7kIXehk6LESDqHBJCt9wmA');
It might appear as if PostgreSQL must lowercase every username in the table, but its query planner is smart enough to see that the expression lower(username) is itself indexed. PostgreSQL uses an index scan.
"Index Scan using users_lower_idx on users (cost=0.43..8.45 rows=1 width=35) (actual time=0.034..0.035 rows=1 loops=1)"
" Index Cond: (lower((username)::text) = 'b0sa9malg7yt1shssajrynqhiddm5d'::text)"
"Total runtime: 0.058 ms"
This table has a million rows of random-ish data; the query returns very, very quickly. It's just about equally fast with the additional condition on "id", but the LIMIT clause slows it down a lot. "Slows it down a lot" doesn't mean it's slow; it still returns in less than 0.1 ms.
Also, here the varchar_pattern_ops lets queries that use the LIKE operator use the index.
explain analyze
select *
from users
where lower(username) like 'b%'
"Bitmap Heap Scan on users (cost=1075.12..9875.78 rows=30303 width=35) (actual time=10.217..91.030 rows=31785 loops=1)"
" Filter: (lower((username)::text) ~~ 'b%'::text)"
" -> Bitmap Index Scan on users_lower_idx (cost=0.00..1067.54 rows=31111 width=0) (actual time=8.648..8.648 rows=31785 loops=1)"
" Index Cond: ((lower((username)::text) ~>=~ 'b'::text) AND (lower((username)::text) ~<~ 'c'::text))"
"Total runtime: 93.541 ms"
Only 94 ms to select and return 30k rows from a million.
Queries on very small tables might use a sequential scan even though there's a usable index. I wouldn't worry about that if I were you.
Enforce uniqueness in the database
If you're expecting any bursts of traffic, you should enforce uniqueness in the database. I do this all the time, regardless of any expectations (guesses) about traffic.
The RailsGuides Active Record Validations includes this slightly misleading or confusing paragraph about the "uniqueness" helper.
This helper validates that the attribute's value is unique right
before the object gets saved. It does not create a uniqueness
constraint in the database, so it may happen that two different
database connections create two records with the same value for a
column that you intend to be unique. To avoid that, you must create a
unique index on both columns in your database. See the MySQL manual
for more details about multiple column indexes.
It clearly says that, in fact, it doesn't guarantee uniqueness. The misleading part is about creating a unique index on "both columns". If you want "username" to be unique, you need to declare a unique constraint on the column "username".
alter table "users"
add constraint constraint_name unique (username);
Case-sensitivity
In SQL databases, case-sensitivity is determined by collation. Collation is part of the SQL standards.
In PostgreSQL, you can set collation at the database level, at the column level, at the index level, and at the query level. Values come from the locales the operating system exposes at the time you create a new database cluster using initdb.
On Linux systems, you probably have no case-insensitive collations. That's one reason we have to jump through rather more hoops than people who target SQL Server and Oracle.

try to run the query in psql using explain analyze, so you make sure postgres is running fine, because apparently the index and query are right.
if it is fast in psql, then there is a problem with your rails code.
this query against a 3k records table gave this result (in my local dev machine):
app=# explain analyze SELECT id AS one FROM "users" WHERE (LOWER(email) = LOWER('marcus#marcus.marcus') AND "users"."id" != 2000);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on users (cost=4.43..58.06 rows=19 width=4) (actual time=0.101..0.101 rows=0 loops=1)
Recheck Cond: (lower((email)::text) = 'marcus#marcus.marcus'::text)
Filter: (id <> 2000)
-> Bitmap Index Scan on users_lower_idx (cost=0.00..4.43 rows=19 width=0) (actual time=0.097..0.097 rows=0 loops=1)
Index Cond: (lower((email)::text) = 'marcus#marcus.marcus'::text)
Total runtime: 0.144 ms
(6 rows)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart