Identical queries; identical indexes; 50000x different speeds - ruby-on-rails

I have a table with two foreign key fields: buyer_id and supplier_id.
Schema.rb includes the following fields and indexes:
t.bigint "buyer_id"
t.bigint "supplier_id"
t.index ["buyer_id"], name: "index_shipments_on_buyer_id"
t.index ["supplier_id"], name: "index_shipments_on_supplier_id"
# As shown in Postgres itself:
"index_shipments_on_buyer_id" btree (buyer_id)
"index_shipments_on_supplier_id" btree (supplier_id)
I am running the same query on both fields:
SELECT "shipments".* FROM "shipments" WHERE "shipments"."buyer_id" = 198917;
SELECT "shipments".* FROM "shipments" WHERE "shipments"."supplier_id" = 198917;
Here's Postgres' approach to each query and the results:
db=# EXPLAIN ANALYZE SELECT "shipments".* FROM "shipments" WHERE "shipments"."buyer_id" = 198917;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on shipments (cost=1996.94..228726.45 rows=59403 width=1325) (actual time=143.089..4822.412 rows=38685 loops=1)
Recheck Cond: (buyer_id = 198917)
Heap Blocks: exact=22308
-> Bitmap Index Scan on index_shipments_on_buyer_id (cost=0.00..1982.09 rows=59403 width=0) (actual time=136.081..136.081 rows=38685 loops=1)
Index Cond: (buyer_id = 198917)
Planning time: 0.176 ms
Execution time: 4831.741 ms
(7 rows)
db=# EXPLAIN ANALYZE SELECT "shipments".* FROM "shipments" WHERE "shipments"."supplier_id" = 198917;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using index_shipments_on_supplier_id on shipments (cost=0.57..7227.48 rows=1840 width=1325) (actual time=0.055..0.055 rows=0 loops=1)
Index Cond: (supplier_id = 198917)
Planning time: 0.203 ms
Execution time: 0.090 ms
(4 rows)
How can I make the first query as fast as the second?
For learning sake, why would Postgres use an entirely different strategy, and achieve such different results, for what seems like identical scenarios?

Related

Should Postgres COPY FROM be updating BRIN index?

Imagine a table like...
create table study_value (
id serial primary key,
study_id int not null references study (id),
category text not null,
subcategory int not null,
p_value double precision not null
);
I knew it would have 25+ million rows and they needed to be quickly queryable by the parent study as well as optionally by category and subcategory, so I chose to add a BRIN to it.
create index study_value_idx
on study_value using brin (study_id, category, subcategory);
All data for a given study (1mil+ rows) was inserted in bulk (ordered by category/subcategory) from a buffer via...
copy study_value from stdin with (format csv, header false);
This study data was uploaded sequentially in order of study id, so the insert orderings fully respected the BRIN column order.
The problem I'm seeing is that querying this table on conditions that the BRIN satisfies, eg. select count(*) from study_value where study_id = 3;, is performing a full scan and taking 30+ seconds. The size of the BRIN itself is 48 kb.
If I reindex index study_value_idx, however, queries now take ~100 ms and the index size is over 100 kb.
Everything I've read (in PG docs, on SO, etc.) indicates that one should only need to reindex in very specific situations (eg. data corruption or indexes failing to build).
I did not need to drop the index before loading data and re-create it afterward, because copying 1 million records into the table only took 10 seconds.
Am I doing something wrong? Is there a better way to do this?
Edit:
I forgot to mention that prior to running reindex, I ran analyze study_value and saw no change.
Yep, my mistake. I needed to VACUUM ANALYZE per #a_horse_with_no_name's comment.
I re-created the table and re-imported data. On fresh load, index size is again 48 kb and query is back to ~30 seconds. I had misread the query plan, though - it does use the index, the actual rows are just wildly different from expected.
Aggregate (cost=231550.86..231550.87 rows=1 width=8) (actual time=32233.141..32233.156 rows=1 loops=1)
-> Bitmap Heap Scan on study_value (cost=6226.26..229546.26 rows=801840 width=0) (actual time=6555.954..27253.035 rows=781580 loops=1)
Recheck Cond: (study_id = 920)
Rows Removed by Index Recheck: 22027434
Heap Blocks: lossy=213169
-> Bitmap Index Scan on study_value_idx (cost=0.00..6025.80 rows=801840 width=0) (actual time=16.345..16.352 rows=2132480 loops=1)
Index Cond: (study_id = 920)
Planning time: 0.941 ms
Execution time: 32233.266 ms
After analyze study_value (3 sec) the idx is still 48 kb and query plan is:
Aggregate (cost=231360.49..231360.50 rows=1 width=8) (actual time=25468.247..25468.259 rows=1 loops=1)
-> Bitmap Heap Scan on study_value (cost=6161.41..229376.81 rows=793472 width=0) (actual time=2740.866..20419.470 rows=781580 loops=1)
Recheck Cond: (study_id = 920)
Rows Removed by Index Recheck: 22027434
Heap Blocks: lossy=213169
-> Bitmap Index Scan on study_value_idx (cost=0.00..5963.04 rows=793472 width=0) (actual time=17.301..17.306 rows=2132480 loops=1)
Index Cond: (study_id = 920)
Planning time: 0.101 ms
Execution time: 25468.389 ms
After vacuum analyze study_value (20 sec) the idx is now 112kb and query plan is..
Aggregate (cost=231496.34..231496.35 rows=1 width=8) (actual time=10038.873..10038.884 rows=1 loops=1)
-> Bitmap Heap Scan on study_value (cost=6228.78..229501.25 rows=798037 width=0) (actual time=12.303..5133.281 rows=781580 loops=1)
Recheck Cond: (study_id = 920)
Rows Removed by Index Recheck: 17962
Heap Blocks: lossy=7473
-> Bitmap Index Scan on study_value_idx (cost=0.00..6029.27 rows=798037 width=0) (actual time=1.644..1.650 rows=75520 loops=1)
Index Cond: (study_id = 920)
Planning time: 0.511 ms
Execution time: 10038.993 ms
Executing a more detail query (ie. including category/subcategory) is much faster, maybe ~400 ms.

Why is performance of joining tables so much faster than joining subqueries

I want to join two big tables (711147 and 469519 rows). But I need just subsets of these tables (44.593 rows and 28.191 rows). When I create temp tables containing the subsets, the join is very quick (below 1 second). When I use subqueries or views, it takes 5 to 10 minutes.
The problem is, that every time when I use this query, the subset (jahr = 2016) has changed. So using the "fast way", each time using it, I would have to recreate the tmp tables first. Problem is, that this query itself is the basis of a view, and I don't know, when the view is used.
The fast way with temp tables looks like this:
select rechnung, art into temp rng16 from rng where jahr = 2016;
select rechnung, artikel, menge, epreis into temp fla16 from fla where jahr = 2016;
explain analyse select * from rng16 natural join fla16;
and the result is:
Merge Join (cost=4783.18..27406.15 rows=1500012 width=104) (actual time=544.691..679.280 rows=44593 loops=1)
Merge Cond: (rng16.rechnung = fla16.rechnung)
-> Sort (cost=1681.83..1714.72 rows=13158 width=64) (actual time=222.233..233.251 rows=27630 loops=1)
Sort Key: rng16.rechnung
Sort Method: external merge Disk: 520kB
-> Seq Scan on rng16 (cost=0.00..284.58 rows=13158 width=64) (actual time=0.009..2.880 rows=28191 loops=1)
-> Materialize (cost=3101.35..3215.35 rows=22800 width=72) (actual time=322.449..362.445 rows=44593 loops=1)
-> Sort (cost=3101.35..3158.35 rows=22800 width=72) (actual time=322.444..356.178 rows=44593 loops=1)
Sort Key: fla16.rechnung
Sort Method: external merge Disk: 1248kB
-> Seq Scan on fla16 (cost=0.00..513.00 rows=22800 width=72) (actual time=0.008..7.832 rows=44593 loops=1)
Total runtime: 682.589 ms
but doing it "on the fly" with two subqueries
explain analyse select * from (select rechnung, art from rng where jahr=2016) rng16 natural join (select rechnung, artikel, menge, epreis from fla where jahr = 2016) fla16;
lasts for ages. Output of explain is:
Nested Loop (cost=0.85..10.98 rows=1 width=21) (actual time=0.036..453240.711 rows=44593 loops=1)
Join Filter: (rng.rechnung = fla.rechnung)
Rows Removed by Join Filter: 1257076670
-> Index Scan using rng_jahr on rng (cost=0.42..5.51 rows=1 width=9) (actual time=0.017..54.372 rows=28191 loops=1)
Index Cond: (jahr = 2016)
-> Index Scan using fla_jahr on fla (cost=0.42..5.46 rows=1 width=19) (actual time=0.020..9.875 rows=44593 loops=28191)
Index Cond: (jahr = 2016)
Total runtime: 453253.579 ms
Instead of using subqueries, try joining tables, like here:
explain analyse
select
rng16.rechnung, rng16.art
fla.rechnung, fla.artikel, fla.menge, flaepreis
from
rng rng16
natural join
fla
where
rng16.jahr = 2016
and fla.jahr = 2016
Is it still a nested loop?

rails query Timeout::Error: execution expired

I have one simple query, but its showing the Timeout::Error: execution expired, also i am using rack::timeout
SELECT SUM(total_checks) as totalcheck FROM "orders" WHERE
(orders.order_status_id NOT IN (15, 17)) AND (orders.check_id = 36) AND
(orders.pass_id = '49') AND (orders.created_at BETWEEN '2016-02-29
22:00:00.000000' AND '2016-03-02 22:00:00.000000') LIMIT 1
also, i have total orders around 9762797, is there any issue with this query?
Got when did that explain analyze
----------
Limit (cost=153.76..153.77 rows=1 width=5) (actual time=14622.323..14622.324
rows=1 loops=1)
-> Aggregate (cost=153.76..153.77 rows=1 width=5) (actual
time=14622.322..14622.322 rows=1 loops=1)
-> Index Scan using idx_orders_check_and_pass on orders
(cost=0.43..153.76 rows=1 width=5) (actual time=2739.717..14621.649 rows=141
loops=1)
Index Cond: ((check_id = 36) AND (pass_id = 49))
Filter: ((order_status_id <> ALL ('{15,17}'::integer[])) AND
(created_at >= '2016-02-29 22:00:00'::timestamp without time zone) AND
(created_at <= '2016-03-02 22:00:00'::timestamp without time zone))
Rows Removed by Filter: 42396
Total runtime: 14622.524 ms
(7 rows)
You have quite big table to run SUM on. I would suggest to use some caching mechanism to avoid using this query, because 14 seconds is a lot.
For example, I would suggest creating new table total_orders_checks and store total checks there. You would need to update it every time you update orders table total_checks value and it might not suit your app design, but you'll definitely get total_checks out of it much faster.

Optimize Postgres Query with Indexes for large amounts of data

I've got a database, posts that has about 20 million rows in it. I'm trying to narrow down the posts for a paginated list using the following query:
SELECT "posts".* FROM "posts"
WHERE "posts"."source_id" IN (14790, 14787, 32928, 14796, 14791, 15503, 14789, 14772, 15506, 14794, 15543, 31615, 15507, 15508, 14800)
AND "posts"."deleted_at" IS NULL
ORDER BY external_created_at desc LIMIT 100 OFFSET 0;
(There are about 3.3 million rows that match the source_id in the query)
When I do so, it takes about 60s, and I get the following EXPLAIN ANALYZE (see on depesz):
EXPLAIN ANALYZE SELECT "posts".* FROM "posts" WHERE "posts"."source_id" IN (14790, 14787, 32928, 14796, 14791, 15503, 14789, 14772, 15506, 14794, 15543, 31615, 15507, 15508, 14800) AND "posts"."deleted_at" IS NULL O
RDER BY external_created_at desc LIMIT 100 OFFSET 0;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=2530223.38..2530223.63 rows=100 width=1040) (actual time=66564.583..66564.616 rows=100 loops=1)
-> Sort (cost=2530223.38..2534981.19 rows=1903125 width=1040) (actual time=66564.571..66564.594 rows=100 loops=1)
Sort Key: external_created_at
Sort Method: top-N heapsort Memory: 89kB
-> Bitmap Heap Scan on posts (cost=35499.76..2457487.31 rows=1903125 width=1040) (actual time=279.640..64496.330 rows=1674072 loops=1)
Recheck Cond: ((source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[])) AND (deleted_at IS NULL))
Rows Removed by Index Recheck: 4640188
-> Bitmap Index Scan on index_on_posts_partial_source_id_with_order (cost=0.00..35023.98 rows=1903125 width=0) (actual time=275.922..275.922 rows=1674072 loops=1)
Index Cond: (source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[]))
Total runtime: 66564.962 ms
(10 rows)
This is the index that it is using:
CREATE INDEX index_on_posts_partial_source_id_with_order ON posts USING btree (source_id) WHERE (deleted_at IS NULL);
It seems that the Recheck Cond is the slowest thing about this query. Everything I see about Recheck Conditions involve upping the memory that postgres uses because the data is "lossy" but I'm not seeing anything like that in my query plan.
Any recommendations as to how I can speed this up?
It seems like somehow getting rid of the Recheck, or somehow ordering by external_created_at will be my best bets.
Edit: I am using postgres version 9.3.4. Here is the posts table:
CREATE TABLE posts (
id integer NOT NULL,
source_id integer,
message text,
image text,
external_id text,
created_at timestamp without time zone,
updated_at timestamp without time zone,
external text,
like_count integer DEFAULT 0 NOT NULL,
comment_count integer DEFAULT 0 NOT NULL,
external_created_at timestamp without time zone,
deleted_at timestamp without time zone,
poster_name character varying(255),
poster_image text,
poster_url character varying(255),
poster_id text,
"position" integer,
location character varying(255),
description text,
video text,
rejected_at timestamp without time zone,
deleted_by character varying(255),
height integer,
width integer
);
Your query is returning a couple million rows for a paginated list. Think hard about the wisdom of returning data for that many pages. Also, think hard about whether you need all the columns. I doubt that you do.
I built a rough table and inserted about 10 million random(ish) rows into it. My query plan using PostgreSQL 9.4 is roughly similar to yours.
"Limit (cost=138609.10..138609.35 rows=100 width=24) (actual time=1410.012..1410.038 rows=100 loops=1)"
" -> Sort (cost=138609.10..140344.25 rows=694059 width=24) (actual time=1410.010..1410.026 rows=100 loops=1)"
" Sort Key: external_created_at"
" Sort Method: top-N heapsort Memory: 29kB"
" -> Bitmap Heap Scan on posts (cost=12217.47..112082.66 rows=694059 width=24) (actual time=374.393..919.687 rows=3000000 loops=1)"
" Recheck Cond: ((source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[])) AND (deleted_at IS NULL))"
" Heap Blocks: exact=16217"
" -> Bitmap Index Scan on index_on_posts_partial_source_id_with_order (cost=0.00..12043.95 rows=694059 width=0) (actual time=370.593..370.593 rows=3000000 loops=1)"
" Index Cond: (source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[]))"
"Planning time: 0.264 ms"
"Execution time: 1410.097 ms"
Adding an index to external_created_at dropped the execution time by a factor of about 470. But I don't have the same distribution of values that you have.
create index on test.posts (external_created_at);
analyze test.posts;
explain analyze
select * from test.posts
where source_id in (14790, 14787, 32928, 14796, 14791, 15503, 14789, 14772, 15506, 14794, 15543, 31615, 15507, 15508, 14800)
and deleted_at is null
order by external_created_at desc limit 100 offset 0;
"Limit (cost=0.43..131.43 rows=100 width=24) (actual time=0.219..2.992 rows=100 loops=1)"
" -> Index Scan Backward using posts_external_created_at_idx on posts (cost=0.43..900991.48 rows=687808 width=24) (actual time=0.216..2.976 rows=100 loops=1)"
" Filter: ((deleted_at IS NULL) AND (source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[])))"
" Rows Removed by Filter: 350"
"Planning time: 0.302 ms"
"Execution time: 3.024 ms"

Rails/Postgresql query is slow

I have a query which is causing a lot of problems on with my Rails app. The query which is being run against 1.3 million rows is as follows:
SELECT COUNT(*)
FROM "products"
INNER JOIN "categories"
ON "categories"."id" = "products"."category_id"
WHERE "products"."disabled" = 'f'
AND (categories.is_adult = 't'
AND merchant_image_url is NOT NULL
AND advertiser_id in (1,3,8,9,12,16,17,18,24,27,31,32,34,40,44,47,48,53,55,57,61,64
,69,78,79,80,81,85,89,91,95,98,102,110,113,114,119,127,128,130,133,134))
The slow part of the query is:
advertiser_id in (1,3,8,9,12,16,17,18,24,27,31,32,34,40,44,47,48,53,55,57,61,64,
69,78,79,80,81,85,89,91,95,98,102,110,113,114,119,127,128,130,133,134))
There is an index on advertiser_id
This is the users preferences and is variable. Without this the query runs blazingly fast.
Please ask for any other information if I've not supplied it and I will add it as soon as I can. Thanks in advance!
Update 1
Here is the Query Plan:
Aggregate (cost=198184.75..198184.76 rows=1 width=0) (actual time=11942.818..11942.818 rows=1 loops=1)
-> Hash Join (cost=8065.09..197269.21 rows=366218 width=0) (actual time=110.651..11821.545 rows=349373 loops=1)
Hash Cond: (products.category_id = categories.id)
-> Bitmap Heap Scan on products (cost=8047.13..192215.75 rows=366218 width=4) (actual time=109.655..11470.039 rows=349373 loops=1)
Recheck Cond: (advertiser_id = ANY ('{1,3,8,9,12,16,17,18,24,27,31,32,34,40,44,47,48,53,55,57,61,64,69,78,79,80,81,85,89,91,95,98,102,110,113,114,119,127,128,130,133,134}'::integer[]))
Rows Removed by Index Recheck: 459153
Filter: ((NOT disabled) AND (merchant_image_url IS NOT NULL))"
Rows Removed by Filter: 112084
-> Bitmap Index Scan on index_products_on_advertiser_id (cost=0.00..7955.57 rows=465290 width=0) (actual time=106.865..106.865 rows=461457 loops=1)
Index Cond: (advertiser_id = ANY ('{1,3,8,9,12,16,17,18,24,27,31,32,34,40,44,47,48,53,55,57,61,64,69,78,79,80,81,85,89,91,95,98,102,110,113,114,119,127,128,130,133,134}'::integer[]))
-> Hash (cost=11.87..11.87 rows=487 width=4) (actual time=0.944..0.944 rows=487 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 18kB
-> Seq Scan on categories (cost=0.00..11.87 rows=487 width=4) (actual time=0.007..0.616 rows=487 loops=1)
Total runtime: 11943.149 ms
and the Explain

Resources