Is PGSQL not executing my index because of the ORDER BY clause?

Is PGSQL not executing my index because of the ORDER BY clause? - ruby-on-rails

I have a rails query that looks like this:
Person.limit(10).unclaimed_people({})
def unclaimed_people(opts)
sub_query = where.not(name_id: nil)
if opts[:q].present?
query_with_email_or_name(sub_query, opts)
else
sub_query.group([:name_id, :effective_name])
.reorder('MIN(COALESCE(kudo_position,999999999)), lower(effective_name)')
.select(:name_id)
end
end
Translating to SQL, the query looks like this:
SELECT "people"."name_id" FROM "people"
WHERE ("people"."name_id" IS NOT NULL)
GROUP BY "people"."name_id", "people"."effective_name"
ORDER BY MIN(COALESCE(kudo_position,999999999)), lower(effective_name) LIMIT 10
Now when I run an EXPLAIN on the SQL query, what's returned shows that I am not running an index scan. Here is the EXPLAIN:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=728151.18..728151.21 rows=10 width=53) (actual time=6333.027..6333.028 rows=10 loops=1)
-> Sort (cost=728151.18..729171.83 rows=408258 width=53) (actual time=6333.024..6333.024 rows=10 loops=1)
Sort Key: (min(COALESCE(kudo_position, 999999999))), (lower(effective_name))
Sort Method: top-N heapsort Memory: 25kB
-> GroupAggregate (cost=676646.88..719328.87 rows=408258 width=53) (actual time=4077.902..6169.151 rows=946982 loops=1)
Group Key: name_id, effective_name
-> Sort (cost=676646.88..686041.57 rows=3757877 width=21) (actual time=4077.846..5106.606 rows=3765261 loops=1)
Sort Key: name_id, effective_name
Sort Method: external merge Disk: 107808kB
-> Seq Scan on people (cost=0.00..112125.78 rows=3757877 width=21) (actual time=0.035..939.682 rows=3765261 loops=1)
Filter: (name_id IS NOT NULL)
Rows Removed by Filter: 317644
Planning time: 0.130 ms
Execution time: 6346.994 ms
Pay attention to the bottom part of the query plan. There is a Seq Scan on people. This is not what I was expecting, in my development and production database, I have placed an index on the foreign name_id field. Here is the proof from the people table.
"index_people_name_id" btree (name_id) WHERE name_id IS NOT NULL
So my question is why would it not be running the index. Could it perhaps be from the ORDER BY clause. I read that it could affect the execution of an index. This is the web page where I read it from. Why isn't my index being used?
In particular here is the quote from the page.
Indexes are normally not used for ORDER BY or to perform joins. A sequential scan followed by an explicit sort is usually faster than an index scan of a large table. However, LIMIT combined with ORDER BY often will use an index because only a small portion of the table is returned.
As you can see from the query, I am using ORDER BY combined with LIMIT so I would expect the index to be used. Could this be outdated? Is the ORDER BY really affecting the query? What am I missing to get the index to work? I'm not particularly versed with the internals of PGSQL so any help would be appreciated.

Related

Multi-column indices ordering by date and created_at exhibit strange behavior for different queries

On postgres 10, I have a query like so, for a table with millions of rows, to grab the latest posts belonging to classrooms:
SELECT "posts".*
FROM "posts"
WHERE "posts"."school_id" = 1
AND "posts"."classroom_id" IN (10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
ORDER BY date desc, created_at desc
LIMIT 30 OFFSET 30;
Assume that classrooms only belong to one school.
I have an index like so:
t.index ["date", "created_at", "school_id", "classroom_id"], name: "optimize_post_pagination"
When I run the query, it does an index scan backwards like I'd hope and return in 0.7ms.
Limit (cost=127336.95..254673.34 rows=30 width=494) (actual time=0.189..0.242 rows=30 loops=1)
-> Index Scan Backward using optimize_post_pagination on posts (cost=0.56..1018691.68 rows=240 width=494) (actual time=0.103..0.236 rows=60 loops=1)
Index Cond: (school_id = 1)
" Filter: (classroom_id = ANY ('{10,11,...}'::integer[]))"
Planning time: 0.112 ms
Execution time: 0.260 ms
However, when I change the query to only include a couple classrooms:
SELECT "posts".*
FROM "posts"
WHERE "posts"."school_id" = 1
AND "posts"."classroom_id" IN (10, 11)
ORDER BY date desc, created_at desc
LIMIT 30 OFFSET 30;
It freaks out and does a lot of extra work, taking nearly 4 sec:
-> Sort (cost=933989.58..933989.68 rows=40 width=494) (actual time=3857.216..3857.219 rows=60 loops=1)
" Sort Key: date DESC, created_at DESC"
Sort Method: top-N heapsort Memory: 61kB
-> Bitmap Heap Scan on posts (cost=615054.27..933988.51 rows=40 width=494) (actual time=2700.871..3851.518 rows=18826 loops=1)
Recheck Cond: (school_id = 1)
" Filter: (classroom_id = ANY ('{10,11}'::integer[]))"
Rows Removed by Filter: 86099
Heap Blocks: exact=29256
-> Bitmap Index Scan on optimize_post_pagination (cost=0.00..615054.26 rows=105020 width=0) (actual time=2696.385..2696.385 rows=104925 loops=1)
Index Cond: (school_id = 485)
What's even stranger is that if I drop the WHERE clause for school_id, both cases for classrooms (with a few or with many) runs fast with the backwards index scan.
This index cookbook suggests putting the ORDER BY index columns last, like so:
t.index ["school_id", "classroom_id", "date", "created_at"], name: "activity_page_index"
But that makes my queries slower, even though the cost is much lower.
Limit (cost=993.93..994.00 rows=30 width=494) (actual time=208.443..208.452 rows=30 loops=1)
-> Sort (cost=993.85..994.45 rows=240 width=494) (actual time=208.436..208.443 rows=60 loops=1)
" Sort Key: date DESC, created_at DESC"
Sort Method: top-N heapsort Memory: 118kB
-> Index Scan using activity_page_index on posts (cost=0.56..985.56 rows=240 width=494) (actual time=0.032..178.147 rows=102403 loops=1)
" Index Cond: ((school_id = 1) AND (classroom_id = ANY ('{10,11,...}'::integer[])))"
Planning time: 0.132 ms
Execution time: 208.482 ms
Interestingly, with the activity_page_index query, it does not change its behavior when querying with fewer classrooms.
So, a few questions:
With the original query, why would the number of classrooms make such a massive difference?
Why does dropping the school_id WHERE clause make both cases run fast?
Why does dropping the school_id WHERE clause make both cases run fast, even though the index still includes school_id?
How can a high cost query finish quickly (65883 -> 0.7ms) and a lower cost query finish slower (994 -> 208ms)?
Other notes
It is necessary to order by both date and created_at, even though they may seem redundant.

Your first plan as shown seems impossible for your query as shown. The school_id = 1 criterion should show up either as an index condition, or as a filter condition, but you don't show it in either one.
With the original query, why would the number of classrooms make such a massive difference?
With the original plan, it is getting the rows in the desired order by walking the index. Then it gets to stop early once it accumulates 60 rows which meet the non-index criteria. So the more selective those other criteria are, the most of the index it needs to walk before it gets enough rows to stop early. Removing classrooms from the list makes it more selective, so makes that plan look less attractive. At some point, it crosses a line where it looks less attractive than something else.
Why does dropping the school_id WHERE clause make both cases run fast?
You said that every classroom belongs to only one school. But PostgreSQL does not know that, it thinks the two criteria are independent, so gets the overall estimated selectivity by multiplying the two separate estimates. This gives it a very misleading estimate of the overall selectivity, which makes the already-ordered index scan look worse than it really is. Not specifying the redundant school_id prevents it from making this bad assumption about the independence of the criteria. You could create multi-column statistics to try to overcome this, but in my hands this doesn't actually help you on this query until v13 (for reasons I don't understand).
This is about the estimation process, not the execution. So school_id being in the index or not doesn't matter.
How can a high cost query finish quickly (65883 -> 0.7ms) and a lower cost query finish slower (994 -> 208ms)?
"It is difficult to make predictions, especially about the future." Cost estimates are predictions. Sometimes they don't work out very well.

Search performance, Active Record, Postgres and Trigram indexes, how to split a query to force postgres to use indexes

I have a table with parts. These parts have a field where all the relevant info is grouped.
I have to perform a search on this field for every word in a search input, using ILIKE with wildcards in both sides. The table has 1.2M rows at the moment.
I have been reading about the best way to index the field to search, and finally decided to go with GIN trigram indexes. The problem is that the query takes too much time when one of the words is shorter than 3 characters, making the search going from blazingly fast to take way longer than 10 secs in many cases.
Examples and measures.
This query makes use of the trigram index and gets done quickly.
SELECT "parts".* FROM "parts" WHERE (parts.eureka ILIKE '%rodamiento%') AND (parts.eureka ILIKE '%skf%') AND (parts.eureka ILIKE '%asf%')
The analyze output is as follows:
Bitmap Heap Scan on parts (cost=716.03..741.93 rows=13 width=195) (actual time=21.194..21.346 rows=29 loops=1)
Recheck Cond: ((eureka ~~* '%rodamiento%'::text) AND (eureka ~~* '%fag%'::text) AND (eureka ~~* '%asf%'::text))
Heap Blocks: exact=17
-> Bitmap Index Scan on parts_eureka_idx (cost=0.00..716.03 rows=13 width=0) (actual time=21.164..21.164 rows=29 loops=1)
Index Cond: ((eureka ~~* '%rodamiento%'::text) AND (eureka ~~* '%fag%'::text) AND (eureka ~~* '%asf%'::text))
Planning Time: 0.614 ms
Execution Time: 21.467 ms
Now on the other hand, this other query uses seq. scan, and it takes way longer. Notice that the only change is %as% instead of %asf%
SELECT "parts".* FROM "parts" WHERE (parts.eureka ILIKE '%rodamiento%') AND (parts.eureka ILIKE '%skf%') AND (parts.eureka ILIKE '%as%')
Gather (cost=1000.00..85599.12 rows=87 width=195) (actual time=0.337..3988.485 rows=6548 loops=1)
Workers Planned: 1
Workers Launched: 1
-> Parallel Seq Scan on parts (cost=0.00..84590.42 rows=51 width=195) (actual time=0.116..3940.638 rows=3274 loops=2)
Filter: ((eureka ~~* '%rodamiento%'::text) AND (eureka ~~* '%fag%'::text) AND (eureka ~~* '%as%'::text))
Rows Removed by Filter: 637016
Planning Time: 1.003 ms
Execution Time: 3989.197 ms
The code for the search is as follows:
Part.rb
class Part < ApplicationRecord
acts_as_copy_target
scope :tipo, ->(tipo) { where tipo: tipo }
def self.search(params)
recordset = Part.all
recordset = recordset.tipo(params[:tipo]) if params[:tipo].present?
recordset = search_keywords(params[:search], recordset)
recordset.order(:price_amount1)
end
private
def self.search_keywords(query, recordset)
keywords = query.to_s.unicode_normalize(:nfc).gsub(/[^[:alnum:]]/, " ").strip.split
if query
keywords.each do |keyword|
recordset = recordset.where("parts.eureka ILIKE :q", q: "%#{keyword}%")
end
recordset
end
end
end
I was thinking about splitting the query like this:
One query to search the words with length >= 3 to leverage the trigram indexes.
Over the returned recordset, make another query for the rest of words. I assume that making a seq scan over a reduced recordset will take shorter than the seq scan shown in the previous analyze output.
Is this a good idea? How can I tell active record to act like that? Any other sugestions to improve this?

There is a proposal to fix this problem, but it has not yet been reviewed and committed. So I don't know if it will make it into version 13 or not.
You can combine your two steps into one, by forcing postgresql not to think it can use the index for the short query strings:
select * from foo where (x ilike '%long%) and (x||'' ilike '%sh%')
The secret is ||'', which inhibits the index usage on that clause without changing the results.
Now how to reverse engineer this into ruby is not a task for me, but based on the snippet you posted it doesn't seem like it should be hard.

How to decrease query execution time on a db with 20 million records | Rails, Postgres

I have a Rails app with Postgres db. It has 20 million records. Most of the queries use ILIKE. I have created a triagram index on one of the columns.
Before adding the triagram index, the query execution time was ~200s to ~300s (seconds not ms)
After creating the triagram index, the query execution time came down to ~30s.
How can I reduce the execution time to milliseconds?
Also are there any good practices/suggestions when dealing with a database this huge?
Thanks in advance :)
Ref : Faster PostgreSQL Searches with Trigrams
Edit: 'Explain Analyze' on one of the queries
EXPLAIN ANALYZE SELECT COUNT(*) FROM "listings" WHERE (categories ilike '%store%');
QUERY PLAN
--------------------------------------------------------------------------
Aggregate (cost=716850.70..716850.71 rows=1 width=0) (actual time=199354.861..199354.861 rows=1 loops=1)
-> Bitmap Heap Scan on listings (cost=3795.12..715827.76 rows=409177 width=0) (actual time=378.374..199005.008 rows=691941 loops=1)
Recheck Cond: ((categories)::text ~~* '%store%'::text)
Rows Removed by Index Recheck: 7302878
Heap Blocks: exact=33686 lossy=448936
-> Bitmap Index Scan on listings_on_categories_idx (cost=0.00..3692.82 rows=409177 width=0) (actual time=367.931..367.931 rows=692449 loops=1)
Index Cond: ((categories)::text ~~* '%store%'::text)
Planning time: 1.345 ms
Execution time: 199355.260 ms
(9 rows)

The index scan itself is fast (0.3 seconds), but the trigram index finds more than half a million potential matches. All of these rows have to be checked if they actually match the pattern, which is where the time is spent.
For longer strings or strings with less common letters the performance should be considerably better. Is it a solution for you to impose a lower bound on the length of the search string?
Other than that, maybe the only solution is to use an external text search software.

Postgres index has massively different impact on raw SQL vs ActiveRecord query

I have a table with 150k names and tried adding an index to lower(name) to speed up lookups. The index speeds up raw SQL queries by about x100, but the same query performed using ActiveRecord is unaffected, if not a bit slower.
These are the queries:
NamedEntity.where("lower(name) = ?", "John Doe".downcase).first
vs
conn.execute(
%q{SELECT "named_entities".* FROM "named_entities" WHERE (lower(name) = 'john doe');}
)
I added the index with
CREATE INDEX index_named_entities_on_lower_name ON named_entities USING btree (lower(name));
Here are the benchmarks comparing all cases (50 executions each):
no index, AR: 6.999421
with index, AR: 7.264234
no index, SQL: 5.569600
with index, SQL: 0.045464
The query plans are the exact same for AR and SQL.
Without index:
Seq Scan on named_entities (cost=0.00..2854.31 rows=785 width=130)
Filter: (lower((name)::text) = 'john doe'::text)
And with index:
Bitmap Heap Scan on named_entities (cost=9.30..982.51 rows=785 width=130)
Recheck Cond: (lower((name)::text) = 'john doe'::text)
-> Bitmap Index Scan on index_named_entities_on_lower_name (cost=0.00..9.26 rows=785 width=0)
Index Cond: (lower((name)::text) = 'john doe'::text)
I have no idea how to explain this. The overhead added by ActiveRecord should not be influenced by the index, so the difference in speed between index and no index should be the same for AR and SQL, no?

I found out how to fix the problem, by just adding ANALYZE named_entities; after creating the index. That makes Postgres update its statistics on all kinds of things, so it can generate a better query plan. (Also found out the Postgres docs are amazing.)
That still doesn't explain the time difference though, as the explains indicated both the SQL and AR queries led to the same slow query plan.

Uniqueness case-sensitive false causes slow query

I have the following validation:
validates :username, uniqueness: { case_sensitive: false }
Which causes the following query to be run painfully slow:
5,510 ms
SELECT ? AS one FROM "users" WHERE (LOWER("users"."username") = LOWER(?) AND "users"."id" != ?) LIMIT ?
Explain plan
1 Query plan Limit (cost=0.03..4.03 rows=1 width=0)
2 Query plan -> Index Scan using idx_users_lower_username on users (cost=0.03..4.03 rows=1 width=0)
3 Query plan Index Cond: ?
4 Query plan Filter: ?
The index was created in my structure.sql using CREATE INDEX idx_users_lower_username ON users USING btree (lower((username)::text)); See my question How to create index on LOWER("users"."username") in Rails (using postgres) for more on this.
This is using the index I set and still takes over 5 seconds? What's wrong here?

There are several different, interrelated things going on here. Exactly how you carry out the changes depends on how you manage changes to your database structure. The most common way is to use Rails migrations, but your linked question suggests you're not doing that. So I'll speak mostly in SQL, and you can adapt that to your method.
Use a sargable WHERE clause
Your WHERE clause isn't sargable. That means it's written in a way that prevents the dbms from using an index. To create an index PostgreSQL can use here . . .
create index on "users" (lower("username") varchar_pattern_ops);
Now queries on lowercased usernames can use that index.
explain analyze
select *
from users
where lower(username) = lower('9LCDgRHk7kIXehk6LESDqHBJCt9wmA');
It might appear as if PostgreSQL must lowercase every username in the table, but its query planner is smart enough to see that the expression lower(username) is itself indexed. PostgreSQL uses an index scan.
"Index Scan using users_lower_idx on users (cost=0.43..8.45 rows=1 width=35) (actual time=0.034..0.035 rows=1 loops=1)"
" Index Cond: (lower((username)::text) = 'b0sa9malg7yt1shssajrynqhiddm5d'::text)"
"Total runtime: 0.058 ms"
This table has a million rows of random-ish data; the query returns very, very quickly. It's just about equally fast with the additional condition on "id", but the LIMIT clause slows it down a lot. "Slows it down a lot" doesn't mean it's slow; it still returns in less than 0.1 ms.
Also, here the varchar_pattern_ops lets queries that use the LIKE operator use the index.
explain analyze
select *
from users
where lower(username) like 'b%'
"Bitmap Heap Scan on users (cost=1075.12..9875.78 rows=30303 width=35) (actual time=10.217..91.030 rows=31785 loops=1)"
" Filter: (lower((username)::text) ~~ 'b%'::text)"
" -> Bitmap Index Scan on users_lower_idx (cost=0.00..1067.54 rows=31111 width=0) (actual time=8.648..8.648 rows=31785 loops=1)"
" Index Cond: ((lower((username)::text) ~>=~ 'b'::text) AND (lower((username)::text) ~<~ 'c'::text))"
"Total runtime: 93.541 ms"
Only 94 ms to select and return 30k rows from a million.
Queries on very small tables might use a sequential scan even though there's a usable index. I wouldn't worry about that if I were you.
Enforce uniqueness in the database
If you're expecting any bursts of traffic, you should enforce uniqueness in the database. I do this all the time, regardless of any expectations (guesses) about traffic.
The RailsGuides Active Record Validations includes this slightly misleading or confusing paragraph about the "uniqueness" helper.
This helper validates that the attribute's value is unique right
before the object gets saved. It does not create a uniqueness
constraint in the database, so it may happen that two different
database connections create two records with the same value for a
column that you intend to be unique. To avoid that, you must create a
unique index on both columns in your database. See the MySQL manual
for more details about multiple column indexes.
It clearly says that, in fact, it doesn't guarantee uniqueness. The misleading part is about creating a unique index on "both columns". If you want "username" to be unique, you need to declare a unique constraint on the column "username".
alter table "users"
add constraint constraint_name unique (username);
Case-sensitivity
In SQL databases, case-sensitivity is determined by collation. Collation is part of the SQL standards.
In PostgreSQL, you can set collation at the database level, at the column level, at the index level, and at the query level. Values come from the locales the operating system exposes at the time you create a new database cluster using initdb.
On Linux systems, you probably have no case-insensitive collations. That's one reason we have to jump through rather more hoops than people who target SQL Server and Oracle.

try to run the query in psql using explain analyze, so you make sure postgres is running fine, because apparently the index and query are right.
if it is fast in psql, then there is a problem with your rails code.
this query against a 3k records table gave this result (in my local dev machine):
app=# explain analyze SELECT id AS one FROM "users" WHERE (LOWER(email) = LOWER('marcus#marcus.marcus') AND "users"."id" != 2000);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on users (cost=4.43..58.06 rows=19 width=4) (actual time=0.101..0.101 rows=0 loops=1)
Recheck Cond: (lower((email)::text) = 'marcus#marcus.marcus'::text)
Filter: (id <> 2000)
-> Bitmap Index Scan on users_lower_idx (cost=0.00..4.43 rows=19 width=0) (actual time=0.097..0.097 rows=0 loops=1)
Index Cond: (lower((email)::text) = 'marcus#marcus.marcus'::text)
Total runtime: 0.144 ms
(6 rows)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart