Uniqueness case-sensitive false causes slow query - ruby-on-rails

I have the following validation:
validates :username, uniqueness: { case_sensitive: false }
Which causes the following query to be run painfully slow:
5,510 ms
SELECT ? AS one FROM "users" WHERE (LOWER("users"."username") = LOWER(?) AND "users"."id" != ?) LIMIT ?
Explain plan
1 Query plan Limit (cost=0.03..4.03 rows=1 width=0)
2 Query plan -> Index Scan using idx_users_lower_username on users (cost=0.03..4.03 rows=1 width=0)
3 Query plan Index Cond: ?
4 Query plan Filter: ?
The index was created in my structure.sql using CREATE INDEX idx_users_lower_username ON users USING btree (lower((username)::text)); See my question How to create index on LOWER("users"."username") in Rails (using postgres) for more on this.
This is using the index I set and still takes over 5 seconds? What's wrong here?

There are several different, interrelated things going on here. Exactly how you carry out the changes depends on how you manage changes to your database structure. The most common way is to use Rails migrations, but your linked question suggests you're not doing that. So I'll speak mostly in SQL, and you can adapt that to your method.
Use a sargable WHERE clause
Your WHERE clause isn't sargable. That means it's written in a way that prevents the dbms from using an index. To create an index PostgreSQL can use here . . .
create index on "users" (lower("username") varchar_pattern_ops);
Now queries on lowercased usernames can use that index.
explain analyze
select *
from users
where lower(username) = lower('9LCDgRHk7kIXehk6LESDqHBJCt9wmA');
It might appear as if PostgreSQL must lowercase every username in the table, but its query planner is smart enough to see that the expression lower(username) is itself indexed. PostgreSQL uses an index scan.
"Index Scan using users_lower_idx on users (cost=0.43..8.45 rows=1 width=35) (actual time=0.034..0.035 rows=1 loops=1)"
" Index Cond: (lower((username)::text) = 'b0sa9malg7yt1shssajrynqhiddm5d'::text)"
"Total runtime: 0.058 ms"
This table has a million rows of random-ish data; the query returns very, very quickly. It's just about equally fast with the additional condition on "id", but the LIMIT clause slows it down a lot. "Slows it down a lot" doesn't mean it's slow; it still returns in less than 0.1 ms.
Also, here the varchar_pattern_ops lets queries that use the LIKE operator use the index.
explain analyze
select *
from users
where lower(username) like 'b%'
"Bitmap Heap Scan on users (cost=1075.12..9875.78 rows=30303 width=35) (actual time=10.217..91.030 rows=31785 loops=1)"
" Filter: (lower((username)::text) ~~ 'b%'::text)"
" -> Bitmap Index Scan on users_lower_idx (cost=0.00..1067.54 rows=31111 width=0) (actual time=8.648..8.648 rows=31785 loops=1)"
" Index Cond: ((lower((username)::text) ~>=~ 'b'::text) AND (lower((username)::text) ~<~ 'c'::text))"
"Total runtime: 93.541 ms"
Only 94 ms to select and return 30k rows from a million.
Queries on very small tables might use a sequential scan even though there's a usable index. I wouldn't worry about that if I were you.
Enforce uniqueness in the database
If you're expecting any bursts of traffic, you should enforce uniqueness in the database. I do this all the time, regardless of any expectations (guesses) about traffic.
The RailsGuides Active Record Validations includes this slightly misleading or confusing paragraph about the "uniqueness" helper.
This helper validates that the attribute's value is unique right
before the object gets saved. It does not create a uniqueness
constraint in the database, so it may happen that two different
database connections create two records with the same value for a
column that you intend to be unique. To avoid that, you must create a
unique index on both columns in your database. See the MySQL manual
for more details about multiple column indexes.
It clearly says that, in fact, it doesn't guarantee uniqueness. The misleading part is about creating a unique index on "both columns". If you want "username" to be unique, you need to declare a unique constraint on the column "username".
alter table "users"
add constraint constraint_name unique (username);
Case-sensitivity
In SQL databases, case-sensitivity is determined by collation. Collation is part of the SQL standards.
In PostgreSQL, you can set collation at the database level, at the column level, at the index level, and at the query level. Values come from the locales the operating system exposes at the time you create a new database cluster using initdb.
On Linux systems, you probably have no case-insensitive collations. That's one reason we have to jump through rather more hoops than people who target SQL Server and Oracle.

try to run the query in psql using explain analyze, so you make sure postgres is running fine, because apparently the index and query are right.
if it is fast in psql, then there is a problem with your rails code.
this query against a 3k records table gave this result (in my local dev machine):
app=# explain analyze SELECT id AS one FROM "users" WHERE (LOWER(email) = LOWER('marcus#marcus.marcus') AND "users"."id" != 2000);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on users (cost=4.43..58.06 rows=19 width=4) (actual time=0.101..0.101 rows=0 loops=1)
Recheck Cond: (lower((email)::text) = 'marcus#marcus.marcus'::text)
Filter: (id <> 2000)
-> Bitmap Index Scan on users_lower_idx (cost=0.00..4.43 rows=19 width=0) (actual time=0.097..0.097 rows=0 loops=1)
Index Cond: (lower((email)::text) = 'marcus#marcus.marcus'::text)
Total runtime: 0.144 ms
(6 rows)

Related

Search performance, Active Record, Postgres and Trigram indexes, how to split a query to force postgres to use indexes

I have a table with parts. These parts have a field where all the relevant info is grouped.
I have to perform a search on this field for every word in a search input, using ILIKE with wildcards in both sides. The table has 1.2M rows at the moment.
I have been reading about the best way to index the field to search, and finally decided to go with GIN trigram indexes. The problem is that the query takes too much time when one of the words is shorter than 3 characters, making the search going from blazingly fast to take way longer than 10 secs in many cases.
Examples and measures.
This query makes use of the trigram index and gets done quickly.
SELECT "parts".* FROM "parts" WHERE (parts.eureka ILIKE '%rodamiento%') AND (parts.eureka ILIKE '%skf%') AND (parts.eureka ILIKE '%asf%')
The analyze output is as follows:
Bitmap Heap Scan on parts (cost=716.03..741.93 rows=13 width=195) (actual time=21.194..21.346 rows=29 loops=1)
Recheck Cond: ((eureka ~~* '%rodamiento%'::text) AND (eureka ~~* '%fag%'::text) AND (eureka ~~* '%asf%'::text))
Heap Blocks: exact=17
-> Bitmap Index Scan on parts_eureka_idx (cost=0.00..716.03 rows=13 width=0) (actual time=21.164..21.164 rows=29 loops=1)
Index Cond: ((eureka ~~* '%rodamiento%'::text) AND (eureka ~~* '%fag%'::text) AND (eureka ~~* '%asf%'::text))
Planning Time: 0.614 ms
Execution Time: 21.467 ms
Now on the other hand, this other query uses seq. scan, and it takes way longer. Notice that the only change is %as% instead of %asf%
SELECT "parts".* FROM "parts" WHERE (parts.eureka ILIKE '%rodamiento%') AND (parts.eureka ILIKE '%skf%') AND (parts.eureka ILIKE '%as%')
Gather (cost=1000.00..85599.12 rows=87 width=195) (actual time=0.337..3988.485 rows=6548 loops=1)
Workers Planned: 1
Workers Launched: 1
-> Parallel Seq Scan on parts (cost=0.00..84590.42 rows=51 width=195) (actual time=0.116..3940.638 rows=3274 loops=2)
Filter: ((eureka ~~* '%rodamiento%'::text) AND (eureka ~~* '%fag%'::text) AND (eureka ~~* '%as%'::text))
Rows Removed by Filter: 637016
Planning Time: 1.003 ms
Execution Time: 3989.197 ms
The code for the search is as follows:
Part.rb
class Part < ApplicationRecord
acts_as_copy_target
scope :tipo, ->(tipo) { where tipo: tipo }
def self.search(params)
recordset = Part.all
recordset = recordset.tipo(params[:tipo]) if params[:tipo].present?
recordset = search_keywords(params[:search], recordset)
recordset.order(:price_amount1)
end
private
def self.search_keywords(query, recordset)
keywords = query.to_s.unicode_normalize(:nfc).gsub(/[^[:alnum:]]/, " ").strip.split
if query
keywords.each do |keyword|
recordset = recordset.where("parts.eureka ILIKE :q", q: "%#{keyword}%")
end
recordset
end
end
end
I was thinking about splitting the query like this:
One query to search the words with length >= 3 to leverage the trigram indexes.
Over the returned recordset, make another query for the rest of words. I assume that making a seq scan over a reduced recordset will take shorter than the seq scan shown in the previous analyze output.
Is this a good idea? How can I tell active record to act like that? Any other sugestions to improve this?
There is a proposal to fix this problem, but it has not yet been reviewed and committed. So I don't know if it will make it into version 13 or not.
You can combine your two steps into one, by forcing postgresql not to think it can use the index for the short query strings:
select * from foo where (x ilike '%long%) and (x||'' ilike '%sh%')
The secret is ||'', which inhibits the index usage on that clause without changing the results.
Now how to reverse engineer this into ruby is not a task for me, but based on the snippet you posted it doesn't seem like it should be hard.

Is PGSQL not executing my index because of the ORDER BY clause?

I have a rails query that looks like this:
Person.limit(10).unclaimed_people({})
def unclaimed_people(opts)
sub_query = where.not(name_id: nil)
if opts[:q].present?
query_with_email_or_name(sub_query, opts)
else
sub_query.group([:name_id, :effective_name])
.reorder('MIN(COALESCE(kudo_position,999999999)), lower(effective_name)')
.select(:name_id)
end
end
Translating to SQL, the query looks like this:
SELECT "people"."name_id" FROM "people"
WHERE ("people"."name_id" IS NOT NULL)
GROUP BY "people"."name_id", "people"."effective_name"
ORDER BY MIN(COALESCE(kudo_position,999999999)), lower(effective_name) LIMIT 10
Now when I run an EXPLAIN on the SQL query, what's returned shows that I am not running an index scan. Here is the EXPLAIN:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=728151.18..728151.21 rows=10 width=53) (actual time=6333.027..6333.028 rows=10 loops=1)
-> Sort (cost=728151.18..729171.83 rows=408258 width=53) (actual time=6333.024..6333.024 rows=10 loops=1)
Sort Key: (min(COALESCE(kudo_position, 999999999))), (lower(effective_name))
Sort Method: top-N heapsort Memory: 25kB
-> GroupAggregate (cost=676646.88..719328.87 rows=408258 width=53) (actual time=4077.902..6169.151 rows=946982 loops=1)
Group Key: name_id, effective_name
-> Sort (cost=676646.88..686041.57 rows=3757877 width=21) (actual time=4077.846..5106.606 rows=3765261 loops=1)
Sort Key: name_id, effective_name
Sort Method: external merge Disk: 107808kB
-> Seq Scan on people (cost=0.00..112125.78 rows=3757877 width=21) (actual time=0.035..939.682 rows=3765261 loops=1)
Filter: (name_id IS NOT NULL)
Rows Removed by Filter: 317644
Planning time: 0.130 ms
Execution time: 6346.994 ms
Pay attention to the bottom part of the query plan. There is a Seq Scan on people. This is not what I was expecting, in my development and production database, I have placed an index on the foreign name_id field. Here is the proof from the people table.
"index_people_name_id" btree (name_id) WHERE name_id IS NOT NULL
So my question is why would it not be running the index. Could it perhaps be from the ORDER BY clause. I read that it could affect the execution of an index. This is the web page where I read it from. Why isn't my index being used?
In particular here is the quote from the page.
Indexes are normally not used for ORDER BY or to perform joins. A sequential scan followed by an explicit sort is usually faster than an index scan of a large table. However, LIMIT combined with ORDER BY often will use an index because only a small portion of the table is returned.
As you can see from the query, I am using ORDER BY combined with LIMIT so I would expect the index to be used. Could this be outdated? Is the ORDER BY really affecting the query? What am I missing to get the index to work? I'm not particularly versed with the internals of PGSQL so any help would be appreciated.

Create postgres index for table with inner join in RubyOnRails

I have an app based on RubyOnRails 4.0. I have two models: Stores and Products. There are about 1.5 million products in the system making it quite slow if I do not use indices properly.
Some basic info
Store has_many Products
Store.affiliate_type_id is used where 1=Affiliated 2=Not affiliated
Products have attributes like "category_connection_id" (integer) and "is_available" (boolean)
In FeededProduct model:
scope :affiliated, -> { joins(:store).where("stores.affiliate_type_id = 1") }
This query takes about 500ms which basically interrupts the website:
FeededProduct.where(:is_available => true).affiliated.where(:category_connection_id => #feeded_product.category_connection_id)
Corresponding postgresql:
FeededProduct Load (481.4ms) SELECT "feeded_products".* FROM "feeded_products" INNER JOIN "stores" ON "stores"."id" = "feeded_products"."store_id" WHERE "feeded_products"."is_available" = 't' AND "feeded_products"."category_connection_id" = 345 AND (stores.affiliate_type_id = 1)
Update. Postgresql EXPLAIN:
QUERY PLAN
-------------------------------------------------------------------------------------------------
Hash Join (cost=477.63..49176.17 rows=21240 width=1084)
Hash Cond: (feeded_products.store_id = stores.id)
-> Bitmap Heap Scan on feeded_products (cost=377.17..48983.06 rows=38580 width=1084)
Recheck Cond: (category_connection_id = 5923)
Filter: is_available
-> Bitmap Index Scan on cc_w_store_index_on_fp (cost=0.00..375.25 rows=38580 width=0)
Index Cond: ((category_connection_id = 5923) AND (is_available = true))
-> Hash (cost=98.87..98.87 rows=452 width=4)
-> Seq Scan on stores (cost=0.00..98.87 rows=452 width=4)
Filter: (affiliate_type_id = 1)
(10 rows)
Question: How can I create an index that will take the inner join into consideration and make this faster?
That depends on the join algorithm that PostgreSQL chooses. Use EXPLAIN on the query to see how PostgreSQL processes the query.
These are the answers depending on the join algorithm:
nested loop join
Here you should create an index on the join condition for the inner relation (the bottom table in the EXPLAIN output). You may further improve things by adding columns that appear in the WHERE clause and significantly improve selectivity (i.e., significantly reduce the number of rows filtered out during the index scan.
For the outer relation, an index on the columns that appear in the WHERE clause will speed up the query if these conditions filter out most of the rows in the table.
hash join
Here it helps to have indexes on both tables on those columns in the WHERE clause where the conditions filter out most of the rows in the table.
merge join
Here you need indexes on the columns in the merge condition to allow PostgreSQL to use an index scan for sorting. Additionally, you can append columns that appear in the WHERE clause.
Always test with EXPLAIN if your indexes get used. If not, odds are that either they cannot be used or that using them would make the query slower than a sequential scan, e.g. because they do not filter out enough rows.

Postgres index has massively different impact on raw SQL vs ActiveRecord query

I have a table with 150k names and tried adding an index to lower(name) to speed up lookups. The index speeds up raw SQL queries by about x100, but the same query performed using ActiveRecord is unaffected, if not a bit slower.
These are the queries:
NamedEntity.where("lower(name) = ?", "John Doe".downcase).first
vs
conn.execute(
%q{SELECT "named_entities".* FROM "named_entities" WHERE (lower(name) = 'john doe');}
)
I added the index with
CREATE INDEX index_named_entities_on_lower_name ON named_entities USING btree (lower(name));
Here are the benchmarks comparing all cases (50 executions each):
no index, AR: 6.999421
with index, AR: 7.264234
no index, SQL: 5.569600
with index, SQL: 0.045464
The query plans are the exact same for AR and SQL.
Without index:
Seq Scan on named_entities (cost=0.00..2854.31 rows=785 width=130)
Filter: (lower((name)::text) = 'john doe'::text)
And with index:
Bitmap Heap Scan on named_entities (cost=9.30..982.51 rows=785 width=130)
Recheck Cond: (lower((name)::text) = 'john doe'::text)
-> Bitmap Index Scan on index_named_entities_on_lower_name (cost=0.00..9.26 rows=785 width=0)
Index Cond: (lower((name)::text) = 'john doe'::text)
I have no idea how to explain this. The overhead added by ActiveRecord should not be influenced by the index, so the difference in speed between index and no index should be the same for AR and SQL, no?
I found out how to fix the problem, by just adding ANALYZE named_entities; after creating the index. That makes Postgres update its statistics on all kinds of things, so it can generate a better query plan. (Also found out the Postgres docs are amazing.)
That still doesn't explain the time difference though, as the explains indicated both the SQL and AR queries led to the same slow query plan.

How to efficiently search for last record matching a condition in Rails and PostgreSQL?

Suppose you want to find the last record entered into the database (highest ID) matching a string: Model.where(:name => 'Joe'). There are 100,000+ records. There are many matches (say thousands).
What is the most efficient way to do this? Does PostgreSQL need to find all the records, or can it just find the last one? Is this a particularly slow query?
Working in Rails 3.0.7, Ruby 1.9.2 and PostgreSQL 8.3.
The important part here is to have a matching index. You can try this small test setup:
Create schema xfor testing:
-- DROP SCHEMA x CASCADE; -- to wipe it all for a retest or when done.
CREATE SCHEMA x;
CREATE TABLE x.tbl(id serial, name text);
Insert 10000 random rows:
INSERT INTO x.tbl(name) SELECT 'x' || generate_series(1,10000);
Insert another 10000 rows with repeating names:
INSERT INTO x.tbl(name) SELECT 'y' || generate_series(1,10000)%20;
Delete random 10% to make it more real life:
DELETE FROM x.tbl WHERE random() < 0.1;
ANALYZE x.tbl;
Query can look like this:
SELECT *
FROM x.tbl
WHERE name = 'y17'
ORDER BY id DESC
LIMIT 1;
--> Total runtime: 5.535 ms
CREATE INDEX tbl_name_idx on x.tbl(name);
--> Total runtime: 1.228 ms
DROP INDEX x.tbl_name_idx;
CREATE INDEX tbl_name_id_idx on x.tbl(name, id);
--> Total runtime: 0.053 ms
DROP INDEX x.tbl_name_id_idx;
CREATE INDEX tbl_name_id_idx on x.tbl(name, id DESC);
--> Total runtime: 0.048 ms
DROP INDEX x.tbl_name_id_idx;
CREATE INDEX tbl_name_idx on x.tbl(name);
CLUSTER x.tbl using tbl_name_idx;
--> Total runtime: 1.144 ms
DROP INDEX x.tbl_name_id_idx;
CREATE INDEX tbl_name_id_idx on x.tbl(name, id DESC);
CLUSTER x.tbl using tbl_name_id_idx;
--> Total runtime: 0.047 ms
Conclusion
With a fitting index, the query performs more than 100x faster.
Top performer is a multicolumn index with the filter column first and the sort column last.
Matching sort order in the index helps a little in this case.
Clustering helps with the simple index, because still many columns have to be read from the table, and these can be found in adjacent blocks after clustering. It doesn't help with the multicolumn index in this case, because only one record has to be fetched from the table.
Read more about multicolumn indexes in the manual.
All of these effects grow with the size of the table. 10000 rows of two tiny columns is just a very small test case.
You can put the query together in Rails and the ORM will write the proper SQL:
Model.where(:name=>"Joe").order('created_at DESC').first
This should not result in retrieving all Model records, nor even a table scan.
This is probably the easiest:
SELECT [columns] FROM [table] WHERE [criteria] ORDER BY [id column] DESC LIMIT 1
Note: Indexing is important here. A huge DB will be slow to search no matter how you do it if you're not indexing the right way.

Resources