Imagine a table like...
create table study_value (
id serial primary key,
study_id int not null references study (id),
category text not null,
subcategory int not null,
p_value double precision not null
);
I knew it would have 25+ million rows and they needed to be quickly queryable by the parent study as well as optionally by category and subcategory, so I chose to add a BRIN to it.
create index study_value_idx
on study_value using brin (study_id, category, subcategory);
All data for a given study (1mil+ rows) was inserted in bulk (ordered by category/subcategory) from a buffer via...
copy study_value from stdin with (format csv, header false);
This study data was uploaded sequentially in order of study id, so the insert orderings fully respected the BRIN column order.
The problem I'm seeing is that querying this table on conditions that the BRIN satisfies, eg. select count(*) from study_value where study_id = 3;, is performing a full scan and taking 30+ seconds. The size of the BRIN itself is 48 kb.
If I reindex index study_value_idx, however, queries now take ~100 ms and the index size is over 100 kb.
Everything I've read (in PG docs, on SO, etc.) indicates that one should only need to reindex in very specific situations (eg. data corruption or indexes failing to build).
I did not need to drop the index before loading data and re-create it afterward, because copying 1 million records into the table only took 10 seconds.
Am I doing something wrong? Is there a better way to do this?
Edit:
I forgot to mention that prior to running reindex, I ran analyze study_value and saw no change.
Yep, my mistake. I needed to VACUUM ANALYZE per #a_horse_with_no_name's comment.
I re-created the table and re-imported data. On fresh load, index size is again 48 kb and query is back to ~30 seconds. I had misread the query plan, though - it does use the index, the actual rows are just wildly different from expected.
Aggregate (cost=231550.86..231550.87 rows=1 width=8) (actual time=32233.141..32233.156 rows=1 loops=1)
-> Bitmap Heap Scan on study_value (cost=6226.26..229546.26 rows=801840 width=0) (actual time=6555.954..27253.035 rows=781580 loops=1)
Recheck Cond: (study_id = 920)
Rows Removed by Index Recheck: 22027434
Heap Blocks: lossy=213169
-> Bitmap Index Scan on study_value_idx (cost=0.00..6025.80 rows=801840 width=0) (actual time=16.345..16.352 rows=2132480 loops=1)
Index Cond: (study_id = 920)
Planning time: 0.941 ms
Execution time: 32233.266 ms
After analyze study_value (3 sec) the idx is still 48 kb and query plan is:
Aggregate (cost=231360.49..231360.50 rows=1 width=8) (actual time=25468.247..25468.259 rows=1 loops=1)
-> Bitmap Heap Scan on study_value (cost=6161.41..229376.81 rows=793472 width=0) (actual time=2740.866..20419.470 rows=781580 loops=1)
Recheck Cond: (study_id = 920)
Rows Removed by Index Recheck: 22027434
Heap Blocks: lossy=213169
-> Bitmap Index Scan on study_value_idx (cost=0.00..5963.04 rows=793472 width=0) (actual time=17.301..17.306 rows=2132480 loops=1)
Index Cond: (study_id = 920)
Planning time: 0.101 ms
Execution time: 25468.389 ms
After vacuum analyze study_value (20 sec) the idx is now 112kb and query plan is..
Aggregate (cost=231496.34..231496.35 rows=1 width=8) (actual time=10038.873..10038.884 rows=1 loops=1)
-> Bitmap Heap Scan on study_value (cost=6228.78..229501.25 rows=798037 width=0) (actual time=12.303..5133.281 rows=781580 loops=1)
Recheck Cond: (study_id = 920)
Rows Removed by Index Recheck: 17962
Heap Blocks: lossy=7473
-> Bitmap Index Scan on study_value_idx (cost=0.00..6029.27 rows=798037 width=0) (actual time=1.644..1.650 rows=75520 loops=1)
Index Cond: (study_id = 920)
Planning time: 0.511 ms
Execution time: 10038.993 ms
Executing a more detail query (ie. including category/subcategory) is much faster, maybe ~400 ms.
I want to join two big tables (711147 and 469519 rows). But I need just subsets of these tables (44.593 rows and 28.191 rows). When I create temp tables containing the subsets, the join is very quick (below 1 second). When I use subqueries or views, it takes 5 to 10 minutes.
The problem is, that every time when I use this query, the subset (jahr = 2016) has changed. So using the "fast way", each time using it, I would have to recreate the tmp tables first. Problem is, that this query itself is the basis of a view, and I don't know, when the view is used.
The fast way with temp tables looks like this:
select rechnung, art into temp rng16 from rng where jahr = 2016;
select rechnung, artikel, menge, epreis into temp fla16 from fla where jahr = 2016;
explain analyse select * from rng16 natural join fla16;
and the result is:
Merge Join (cost=4783.18..27406.15 rows=1500012 width=104) (actual time=544.691..679.280 rows=44593 loops=1)
Merge Cond: (rng16.rechnung = fla16.rechnung)
-> Sort (cost=1681.83..1714.72 rows=13158 width=64) (actual time=222.233..233.251 rows=27630 loops=1)
Sort Key: rng16.rechnung
Sort Method: external merge Disk: 520kB
-> Seq Scan on rng16 (cost=0.00..284.58 rows=13158 width=64) (actual time=0.009..2.880 rows=28191 loops=1)
-> Materialize (cost=3101.35..3215.35 rows=22800 width=72) (actual time=322.449..362.445 rows=44593 loops=1)
-> Sort (cost=3101.35..3158.35 rows=22800 width=72) (actual time=322.444..356.178 rows=44593 loops=1)
Sort Key: fla16.rechnung
Sort Method: external merge Disk: 1248kB
-> Seq Scan on fla16 (cost=0.00..513.00 rows=22800 width=72) (actual time=0.008..7.832 rows=44593 loops=1)
Total runtime: 682.589 ms
but doing it "on the fly" with two subqueries
explain analyse select * from (select rechnung, art from rng where jahr=2016) rng16 natural join (select rechnung, artikel, menge, epreis from fla where jahr = 2016) fla16;
lasts for ages. Output of explain is:
Nested Loop (cost=0.85..10.98 rows=1 width=21) (actual time=0.036..453240.711 rows=44593 loops=1)
Join Filter: (rng.rechnung = fla.rechnung)
Rows Removed by Join Filter: 1257076670
-> Index Scan using rng_jahr on rng (cost=0.42..5.51 rows=1 width=9) (actual time=0.017..54.372 rows=28191 loops=1)
Index Cond: (jahr = 2016)
-> Index Scan using fla_jahr on fla (cost=0.42..5.46 rows=1 width=19) (actual time=0.020..9.875 rows=44593 loops=28191)
Index Cond: (jahr = 2016)
Total runtime: 453253.579 ms
Instead of using subqueries, try joining tables, like here:
explain analyse
select
rng16.rechnung, rng16.art
fla.rechnung, fla.artikel, fla.menge, flaepreis
from
rng rng16
natural join
fla
where
rng16.jahr = 2016
and fla.jahr = 2016
Is it still a nested loop?
I have a Rails app with Postgres db. It has 20 million records. Most of the queries use ILIKE. I have created a triagram index on one of the columns.
Before adding the triagram index, the query execution time was ~200s to ~300s (seconds not ms)
After creating the triagram index, the query execution time came down to ~30s.
How can I reduce the execution time to milliseconds?
Also are there any good practices/suggestions when dealing with a database this huge?
Thanks in advance :)
Ref : Faster PostgreSQL Searches with Trigrams
Edit: 'Explain Analyze' on one of the queries
EXPLAIN ANALYZE SELECT COUNT(*) FROM "listings" WHERE (categories ilike '%store%');
QUERY PLAN
--------------------------------------------------------------------------
Aggregate (cost=716850.70..716850.71 rows=1 width=0) (actual time=199354.861..199354.861 rows=1 loops=1)
-> Bitmap Heap Scan on listings (cost=3795.12..715827.76 rows=409177 width=0) (actual time=378.374..199005.008 rows=691941 loops=1)
Recheck Cond: ((categories)::text ~~* '%store%'::text)
Rows Removed by Index Recheck: 7302878
Heap Blocks: exact=33686 lossy=448936
-> Bitmap Index Scan on listings_on_categories_idx (cost=0.00..3692.82 rows=409177 width=0) (actual time=367.931..367.931 rows=692449 loops=1)
Index Cond: ((categories)::text ~~* '%store%'::text)
Planning time: 1.345 ms
Execution time: 199355.260 ms
(9 rows)
The index scan itself is fast (0.3 seconds), but the trigram index finds more than half a million potential matches. All of these rows have to be checked if they actually match the pattern, which is where the time is spent.
For longer strings or strings with less common letters the performance should be considerably better. Is it a solution for you to impose a lower bound on the length of the search string?
Other than that, maybe the only solution is to use an external text search software.
I have one simple query, but its showing the Timeout::Error: execution expired, also i am using rack::timeout
SELECT SUM(total_checks) as totalcheck FROM "orders" WHERE
(orders.order_status_id NOT IN (15, 17)) AND (orders.check_id = 36) AND
(orders.pass_id = '49') AND (orders.created_at BETWEEN '2016-02-29
22:00:00.000000' AND '2016-03-02 22:00:00.000000') LIMIT 1
also, i have total orders around 9762797, is there any issue with this query?
Got when did that explain analyze
----------
Limit (cost=153.76..153.77 rows=1 width=5) (actual time=14622.323..14622.324
rows=1 loops=1)
-> Aggregate (cost=153.76..153.77 rows=1 width=5) (actual
time=14622.322..14622.322 rows=1 loops=1)
-> Index Scan using idx_orders_check_and_pass on orders
(cost=0.43..153.76 rows=1 width=5) (actual time=2739.717..14621.649 rows=141
loops=1)
Index Cond: ((check_id = 36) AND (pass_id = 49))
Filter: ((order_status_id <> ALL ('{15,17}'::integer[])) AND
(created_at >= '2016-02-29 22:00:00'::timestamp without time zone) AND
(created_at <= '2016-03-02 22:00:00'::timestamp without time zone))
Rows Removed by Filter: 42396
Total runtime: 14622.524 ms
(7 rows)
You have quite big table to run SUM on. I would suggest to use some caching mechanism to avoid using this query, because 14 seconds is a lot.
For example, I would suggest creating new table total_orders_checks and store total checks there. You would need to update it every time you update orders table total_checks value and it might not suit your app design, but you'll definitely get total_checks out of it much faster.
I've got a database, posts that has about 20 million rows in it. I'm trying to narrow down the posts for a paginated list using the following query:
SELECT "posts".* FROM "posts"
WHERE "posts"."source_id" IN (14790, 14787, 32928, 14796, 14791, 15503, 14789, 14772, 15506, 14794, 15543, 31615, 15507, 15508, 14800)
AND "posts"."deleted_at" IS NULL
ORDER BY external_created_at desc LIMIT 100 OFFSET 0;
(There are about 3.3 million rows that match the source_id in the query)
When I do so, it takes about 60s, and I get the following EXPLAIN ANALYZE (see on depesz):
EXPLAIN ANALYZE SELECT "posts".* FROM "posts" WHERE "posts"."source_id" IN (14790, 14787, 32928, 14796, 14791, 15503, 14789, 14772, 15506, 14794, 15543, 31615, 15507, 15508, 14800) AND "posts"."deleted_at" IS NULL O
RDER BY external_created_at desc LIMIT 100 OFFSET 0;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=2530223.38..2530223.63 rows=100 width=1040) (actual time=66564.583..66564.616 rows=100 loops=1)
-> Sort (cost=2530223.38..2534981.19 rows=1903125 width=1040) (actual time=66564.571..66564.594 rows=100 loops=1)
Sort Key: external_created_at
Sort Method: top-N heapsort Memory: 89kB
-> Bitmap Heap Scan on posts (cost=35499.76..2457487.31 rows=1903125 width=1040) (actual time=279.640..64496.330 rows=1674072 loops=1)
Recheck Cond: ((source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[])) AND (deleted_at IS NULL))
Rows Removed by Index Recheck: 4640188
-> Bitmap Index Scan on index_on_posts_partial_source_id_with_order (cost=0.00..35023.98 rows=1903125 width=0) (actual time=275.922..275.922 rows=1674072 loops=1)
Index Cond: (source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[]))
Total runtime: 66564.962 ms
(10 rows)
This is the index that it is using:
CREATE INDEX index_on_posts_partial_source_id_with_order ON posts USING btree (source_id) WHERE (deleted_at IS NULL);
It seems that the Recheck Cond is the slowest thing about this query. Everything I see about Recheck Conditions involve upping the memory that postgres uses because the data is "lossy" but I'm not seeing anything like that in my query plan.
Any recommendations as to how I can speed this up?
It seems like somehow getting rid of the Recheck, or somehow ordering by external_created_at will be my best bets.
Edit: I am using postgres version 9.3.4. Here is the posts table:
CREATE TABLE posts (
id integer NOT NULL,
source_id integer,
message text,
image text,
external_id text,
created_at timestamp without time zone,
updated_at timestamp without time zone,
external text,
like_count integer DEFAULT 0 NOT NULL,
comment_count integer DEFAULT 0 NOT NULL,
external_created_at timestamp without time zone,
deleted_at timestamp without time zone,
poster_name character varying(255),
poster_image text,
poster_url character varying(255),
poster_id text,
"position" integer,
location character varying(255),
description text,
video text,
rejected_at timestamp without time zone,
deleted_by character varying(255),
height integer,
width integer
);
Your query is returning a couple million rows for a paginated list. Think hard about the wisdom of returning data for that many pages. Also, think hard about whether you need all the columns. I doubt that you do.
I built a rough table and inserted about 10 million random(ish) rows into it. My query plan using PostgreSQL 9.4 is roughly similar to yours.
"Limit (cost=138609.10..138609.35 rows=100 width=24) (actual time=1410.012..1410.038 rows=100 loops=1)"
" -> Sort (cost=138609.10..140344.25 rows=694059 width=24) (actual time=1410.010..1410.026 rows=100 loops=1)"
" Sort Key: external_created_at"
" Sort Method: top-N heapsort Memory: 29kB"
" -> Bitmap Heap Scan on posts (cost=12217.47..112082.66 rows=694059 width=24) (actual time=374.393..919.687 rows=3000000 loops=1)"
" Recheck Cond: ((source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[])) AND (deleted_at IS NULL))"
" Heap Blocks: exact=16217"
" -> Bitmap Index Scan on index_on_posts_partial_source_id_with_order (cost=0.00..12043.95 rows=694059 width=0) (actual time=370.593..370.593 rows=3000000 loops=1)"
" Index Cond: (source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[]))"
"Planning time: 0.264 ms"
"Execution time: 1410.097 ms"
Adding an index to external_created_at dropped the execution time by a factor of about 470. But I don't have the same distribution of values that you have.
create index on test.posts (external_created_at);
analyze test.posts;
explain analyze
select * from test.posts
where source_id in (14790, 14787, 32928, 14796, 14791, 15503, 14789, 14772, 15506, 14794, 15543, 31615, 15507, 15508, 14800)
and deleted_at is null
order by external_created_at desc limit 100 offset 0;
"Limit (cost=0.43..131.43 rows=100 width=24) (actual time=0.219..2.992 rows=100 loops=1)"
" -> Index Scan Backward using posts_external_created_at_idx on posts (cost=0.43..900991.48 rows=687808 width=24) (actual time=0.216..2.976 rows=100 loops=1)"
" Filter: ((deleted_at IS NULL) AND (source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[])))"
" Rows Removed by Filter: 350"
"Planning time: 0.302 ms"
"Execution time: 3.024 ms"