Should Postgres COPY FROM be updating BRIN index? - postgresql-9.6

Imagine a table like...
create table study_value (
id serial primary key,
study_id int not null references study (id),
category text not null,
subcategory int not null,
p_value double precision not null
);
I knew it would have 25+ million rows and they needed to be quickly queryable by the parent study as well as optionally by category and subcategory, so I chose to add a BRIN to it.
create index study_value_idx
on study_value using brin (study_id, category, subcategory);
All data for a given study (1mil+ rows) was inserted in bulk (ordered by category/subcategory) from a buffer via...
copy study_value from stdin with (format csv, header false);
This study data was uploaded sequentially in order of study id, so the insert orderings fully respected the BRIN column order.
The problem I'm seeing is that querying this table on conditions that the BRIN satisfies, eg. select count(*) from study_value where study_id = 3;, is performing a full scan and taking 30+ seconds. The size of the BRIN itself is 48 kb.
If I reindex index study_value_idx, however, queries now take ~100 ms and the index size is over 100 kb.
Everything I've read (in PG docs, on SO, etc.) indicates that one should only need to reindex in very specific situations (eg. data corruption or indexes failing to build).
I did not need to drop the index before loading data and re-create it afterward, because copying 1 million records into the table only took 10 seconds.
Am I doing something wrong? Is there a better way to do this?
Edit:
I forgot to mention that prior to running reindex, I ran analyze study_value and saw no change.

Yep, my mistake. I needed to VACUUM ANALYZE per #a_horse_with_no_name's comment.
I re-created the table and re-imported data. On fresh load, index size is again 48 kb and query is back to ~30 seconds. I had misread the query plan, though - it does use the index, the actual rows are just wildly different from expected.
Aggregate (cost=231550.86..231550.87 rows=1 width=8) (actual time=32233.141..32233.156 rows=1 loops=1)
-> Bitmap Heap Scan on study_value (cost=6226.26..229546.26 rows=801840 width=0) (actual time=6555.954..27253.035 rows=781580 loops=1)
Recheck Cond: (study_id = 920)
Rows Removed by Index Recheck: 22027434
Heap Blocks: lossy=213169
-> Bitmap Index Scan on study_value_idx (cost=0.00..6025.80 rows=801840 width=0) (actual time=16.345..16.352 rows=2132480 loops=1)
Index Cond: (study_id = 920)
Planning time: 0.941 ms
Execution time: 32233.266 ms
After analyze study_value (3 sec) the idx is still 48 kb and query plan is:
Aggregate (cost=231360.49..231360.50 rows=1 width=8) (actual time=25468.247..25468.259 rows=1 loops=1)
-> Bitmap Heap Scan on study_value (cost=6161.41..229376.81 rows=793472 width=0) (actual time=2740.866..20419.470 rows=781580 loops=1)
Recheck Cond: (study_id = 920)
Rows Removed by Index Recheck: 22027434
Heap Blocks: lossy=213169
-> Bitmap Index Scan on study_value_idx (cost=0.00..5963.04 rows=793472 width=0) (actual time=17.301..17.306 rows=2132480 loops=1)
Index Cond: (study_id = 920)
Planning time: 0.101 ms
Execution time: 25468.389 ms
After vacuum analyze study_value (20 sec) the idx is now 112kb and query plan is..
Aggregate (cost=231496.34..231496.35 rows=1 width=8) (actual time=10038.873..10038.884 rows=1 loops=1)
-> Bitmap Heap Scan on study_value (cost=6228.78..229501.25 rows=798037 width=0) (actual time=12.303..5133.281 rows=781580 loops=1)
Recheck Cond: (study_id = 920)
Rows Removed by Index Recheck: 17962
Heap Blocks: lossy=7473
-> Bitmap Index Scan on study_value_idx (cost=0.00..6029.27 rows=798037 width=0) (actual time=1.644..1.650 rows=75520 loops=1)
Index Cond: (study_id = 920)
Planning time: 0.511 ms
Execution time: 10038.993 ms
Executing a more detail query (ie. including category/subcategory) is much faster, maybe ~400 ms.

Related

GreenPlum choosing a bad query plan in join query

Please forgive my poor English
I have two tables in greenplum(version is: PostgreSQL 9.4.20 (Greenplum Database 6.0.0-beta.3) )
one table is : cookie_session
CREATE TABLE "ods_overall_cookie"."cookie_session" (
"site_cookie" varchar(80) COLLATE "pg_catalog"."default",
"createtime" timestamp(6),
"analyse_domain_cookie" varchar(30) COLLATE "pg_catalog"."default",
"id" int4 NOT NULL,
.... other fields....
)
DISTRIBUTED by(analyse_domain_cookie)
;
CREATE INDEX "index_cookie_session_id" ON "ods_overall_cookie"."cookie_session" USING btree (
"id" "pg_catalog"."int4_ops" ASC NULLS LAST
);
CREATE INDEX "index_analysis_domain_cookie_btree" ON "ods_overall_cookie"."cookie_session" USING btree (
"analyse_domain_cookie" COLLATE "pg_catalog"."default" "pg_catalog"."text_ops" ASC NULLS LAST
);
and another table :ta202202
CREATE TABLE "ods_log"."ta202202" (
"id" serial8,
"uvcookie" varchar(50) COLLATE "pg_catalog"."default",
.... other fields ...
) distributed by (uvcookie)
;
CREATE INDEX "index_ta202202_id" ON "ods_log"."ta202202" USING btree (
"id" "pg_catalog"."int8_ops" ASC NULLS LAST
);
CREATE INDEX "indev_ta202202_uvcookie" ON "ods_log"."ta202202" USING btree (
"uvcookie" COLLATE "pg_catalog"."default" "pg_catalog"."text_ops" ASC NULLS LAST
);
The two tables have about 100 million data respectively.
My query sql is:
select o.id,g.site_cookie
from ods_log.ta202201 o
join ods_overall_cookie.cookie_session as g
on g.analyse_domain_cookie = o.uvcookie
WHERE o.ID BETWEEN 20000000 and 20000077;
this query return in 0.14 seconds, explain ANALYZE result is:
Gather Motion 24:1 (slice1; segments: 24) (cost=0.00..434.40 rows=1 width=41) (actual time=1.785..4.098 rows=552 loops=1)
-> Nested Loop (cost=0.00..434.40 rows=1 width=41) (actual time=0.225..1.948 rows=276 loops=1)
Join Filter: true
-> Index Scan using index_ta202201_id on ta202201 (cost=0.00..6.02 rows=3 width=25) (actual time=0.100..0.142 rows=8 loops=1)
Index Cond: ((id >= 20000000) AND (id <= 20000077))
-> Index Scan using index_analysis_domain_cookie_btree on cookie_session (cost=0.00..428.38 rows=1 width=33) (actual time=0.013..0.213 rows=34 loops=8)
Index Cond: ((analyse_domain_cookie)::text = (ta202201.uvcookie)::text)
Planning time: 59.930 ms
(slice0) Executor memory: 216K bytes.
(slice1) Executor memory: 156K bytes avg x 24 workers, 156K bytes max (seg0).
(slice2)
Memory used: 128000kB
Optimizer: Pivotal Optimizer (GPORCA) version 3.39.0
Execution time: 26.725 ms
It seems use Nested Loop
But when I increase the ID range in the where condition,Even if only plus 1, like: o.ID BETWEEN 20000000 and 20000078,
Time consumption has become 25 seconds, increasing 200 times
Gather Motion 24:1 (slice1; segments: 24) (cost=0.00..437.02 rows=1 width=41) (actual time=10266.694..23884.316 rows=557 loops=1)
-> Hash Join (cost=0.00..437.02 rows=1 width=41) (actual time=12256.944..23881.566 rows=276 loops=1)
Hash Cond: ((ta202201.uvcookie)::text = (cookie_session.analyse_domain_cookie)::text)
Extra Text: (seg0) Initial batch 0:
(seg0) Wrote 874907K bytes to inner workfile.
(seg0) Wrote 1K bytes to outer workfile.
(seg0) Overflow batches 1..7:
(seg0) Read 1200209K bytes from inner workfile: 171459K avg x 7 nonempty batches, 335761K max.
(seg0) Wrote 766456K bytes to inner workfile: 127743K avg x 6 overflowing batches, 304587K max.
(seg0) Read 1K bytes from outer workfile: 1K avg x 4 nonempty batches, 1K max.
(seg0) Wrote 1K bytes to outer workfile.
(seg0) Secondary Overflow batches 8..32767:
(seg0) Read 2014970K bytes from inner workfile: 9201K avg x 219 nonempty batches, 258871K max.
(seg0) Wrote 1573816K bytes to inner workfile: 12107K avg x 130 overflowing batches, 247277K max.
(seg0) Read 1K bytes from outer workfile.
(seg0) Hash chain length 4.2 avg, 4645100 max, using 3735148 of 59506688 buckets. Skipped 32541 empty batches.
-> Index Scan using index_ta202201_id on ta202201 (cost=0.00..6.02 rows=4 width=25) (actual time=0.380..0.428 rows=8 loops=1)
Index Cond: ((id >= 20000000) AND (id <= 20000078))
-> Hash (cost=431.00..431.00 rows=1 width=51) (actual time=12253.540..12253.540 rows=15647864 loops=1)
-> Seq Scan on cookie_session (cost=0.00..431.00 rows=1 width=51) (actual time=0.058..5175.550 rows=15647865 loops=1)
Planning time: 62.416 ms
(slice0) Executor memory: 184K bytes.
* (slice1) Executor memory: 245659K bytes avg x 24 workers, 375566K bytes max (seg0). Work_mem: 290371K bytes max, 1149907K bytes wanted.
Memory used: 128000kB
Memory wanted: 1150306kB
Optimizer: Pivotal Optimizer (GPORCA) version 3.39.0
Execution time: 23927.425 ms
My query plain from Nested Loop changed to Hash Join, seems greenplum choosing a bad query plan.
I continue to adjust my between condition, result is:
from
to
plan
speed
20000000
20000077
Nested Loop
Fast
20000000
20000078
Hash Join
Slow
20000001
20000078
Nested Loop
Fast
20000001
20000079
Hash Join
Slow
20000002
20000079
Nested Loop
Fast
30000000
30000068
Nested Loop
Fast
30000000
30000069
Hash Join
Slow
30000001
30000069
Nested Loop
Fast
I tried:
set enable_nestloop= on;
set enable_hashjoin = off;
set enable_mergejoin = off;
or change my query like :
select xxx form a,b where a.id between xxx and xxx and a.uvcookie = b.analyse_domain_cookie
or change to: left join / inner join / full join
But things haven't changed.
So: Please help me tell me how I should adjust. I want to query the data of the table in pages according to the span of 1000 for ID field. At present, if greenplum continue to use the nested loop plan, it is obviously faster than Hash Join
According to kaiwen's answer, execute
SET optimizer_enable_hashjoin = OFF;
SET optimizer_enable_tablescan = OFF;
it works, The problem is finally solved

Why is performance of joining tables so much faster than joining subqueries

I want to join two big tables (711147 and 469519 rows). But I need just subsets of these tables (44.593 rows and 28.191 rows). When I create temp tables containing the subsets, the join is very quick (below 1 second). When I use subqueries or views, it takes 5 to 10 minutes.
The problem is, that every time when I use this query, the subset (jahr = 2016) has changed. So using the "fast way", each time using it, I would have to recreate the tmp tables first. Problem is, that this query itself is the basis of a view, and I don't know, when the view is used.
The fast way with temp tables looks like this:
select rechnung, art into temp rng16 from rng where jahr = 2016;
select rechnung, artikel, menge, epreis into temp fla16 from fla where jahr = 2016;
explain analyse select * from rng16 natural join fla16;
and the result is:
Merge Join (cost=4783.18..27406.15 rows=1500012 width=104) (actual time=544.691..679.280 rows=44593 loops=1)
Merge Cond: (rng16.rechnung = fla16.rechnung)
-> Sort (cost=1681.83..1714.72 rows=13158 width=64) (actual time=222.233..233.251 rows=27630 loops=1)
Sort Key: rng16.rechnung
Sort Method: external merge Disk: 520kB
-> Seq Scan on rng16 (cost=0.00..284.58 rows=13158 width=64) (actual time=0.009..2.880 rows=28191 loops=1)
-> Materialize (cost=3101.35..3215.35 rows=22800 width=72) (actual time=322.449..362.445 rows=44593 loops=1)
-> Sort (cost=3101.35..3158.35 rows=22800 width=72) (actual time=322.444..356.178 rows=44593 loops=1)
Sort Key: fla16.rechnung
Sort Method: external merge Disk: 1248kB
-> Seq Scan on fla16 (cost=0.00..513.00 rows=22800 width=72) (actual time=0.008..7.832 rows=44593 loops=1)
Total runtime: 682.589 ms
but doing it "on the fly" with two subqueries
explain analyse select * from (select rechnung, art from rng where jahr=2016) rng16 natural join (select rechnung, artikel, menge, epreis from fla where jahr = 2016) fla16;
lasts for ages. Output of explain is:
Nested Loop (cost=0.85..10.98 rows=1 width=21) (actual time=0.036..453240.711 rows=44593 loops=1)
Join Filter: (rng.rechnung = fla.rechnung)
Rows Removed by Join Filter: 1257076670
-> Index Scan using rng_jahr on rng (cost=0.42..5.51 rows=1 width=9) (actual time=0.017..54.372 rows=28191 loops=1)
Index Cond: (jahr = 2016)
-> Index Scan using fla_jahr on fla (cost=0.42..5.46 rows=1 width=19) (actual time=0.020..9.875 rows=44593 loops=28191)
Index Cond: (jahr = 2016)
Total runtime: 453253.579 ms
Instead of using subqueries, try joining tables, like here:
explain analyse
select
rng16.rechnung, rng16.art
fla.rechnung, fla.artikel, fla.menge, flaepreis
from
rng rng16
natural join
fla
where
rng16.jahr = 2016
and fla.jahr = 2016
Is it still a nested loop?

How to decrease query execution time on a db with 20 million records | Rails, Postgres

I have a Rails app with Postgres db. It has 20 million records. Most of the queries use ILIKE. I have created a triagram index on one of the columns.
Before adding the triagram index, the query execution time was ~200s to ~300s (seconds not ms)
After creating the triagram index, the query execution time came down to ~30s.
How can I reduce the execution time to milliseconds?
Also are there any good practices/suggestions when dealing with a database this huge?
Thanks in advance :)
Ref : Faster PostgreSQL Searches with Trigrams
Edit: 'Explain Analyze' on one of the queries
EXPLAIN ANALYZE SELECT COUNT(*) FROM "listings" WHERE (categories ilike '%store%');
QUERY PLAN
--------------------------------------------------------------------------
Aggregate (cost=716850.70..716850.71 rows=1 width=0) (actual time=199354.861..199354.861 rows=1 loops=1)
-> Bitmap Heap Scan on listings (cost=3795.12..715827.76 rows=409177 width=0) (actual time=378.374..199005.008 rows=691941 loops=1)
Recheck Cond: ((categories)::text ~~* '%store%'::text)
Rows Removed by Index Recheck: 7302878
Heap Blocks: exact=33686 lossy=448936
-> Bitmap Index Scan on listings_on_categories_idx (cost=0.00..3692.82 rows=409177 width=0) (actual time=367.931..367.931 rows=692449 loops=1)
Index Cond: ((categories)::text ~~* '%store%'::text)
Planning time: 1.345 ms
Execution time: 199355.260 ms
(9 rows)
The index scan itself is fast (0.3 seconds), but the trigram index finds more than half a million potential matches. All of these rows have to be checked if they actually match the pattern, which is where the time is spent.
For longer strings or strings with less common letters the performance should be considerably better. Is it a solution for you to impose a lower bound on the length of the search string?
Other than that, maybe the only solution is to use an external text search software.

rails query Timeout::Error: execution expired

I have one simple query, but its showing the Timeout::Error: execution expired, also i am using rack::timeout
SELECT SUM(total_checks) as totalcheck FROM "orders" WHERE
(orders.order_status_id NOT IN (15, 17)) AND (orders.check_id = 36) AND
(orders.pass_id = '49') AND (orders.created_at BETWEEN '2016-02-29
22:00:00.000000' AND '2016-03-02 22:00:00.000000') LIMIT 1
also, i have total orders around 9762797, is there any issue with this query?
Got when did that explain analyze
----------
Limit (cost=153.76..153.77 rows=1 width=5) (actual time=14622.323..14622.324
rows=1 loops=1)
-> Aggregate (cost=153.76..153.77 rows=1 width=5) (actual
time=14622.322..14622.322 rows=1 loops=1)
-> Index Scan using idx_orders_check_and_pass on orders
(cost=0.43..153.76 rows=1 width=5) (actual time=2739.717..14621.649 rows=141
loops=1)
Index Cond: ((check_id = 36) AND (pass_id = 49))
Filter: ((order_status_id <> ALL ('{15,17}'::integer[])) AND
(created_at >= '2016-02-29 22:00:00'::timestamp without time zone) AND
(created_at <= '2016-03-02 22:00:00'::timestamp without time zone))
Rows Removed by Filter: 42396
Total runtime: 14622.524 ms
(7 rows)
You have quite big table to run SUM on. I would suggest to use some caching mechanism to avoid using this query, because 14 seconds is a lot.
For example, I would suggest creating new table total_orders_checks and store total checks there. You would need to update it every time you update orders table total_checks value and it might not suit your app design, but you'll definitely get total_checks out of it much faster.

Rails/Postgresql query is slow

I have a query which is causing a lot of problems on with my Rails app. The query which is being run against 1.3 million rows is as follows:
SELECT COUNT(*)
FROM "products"
INNER JOIN "categories"
ON "categories"."id" = "products"."category_id"
WHERE "products"."disabled" = 'f'
AND (categories.is_adult = 't'
AND merchant_image_url is NOT NULL
AND advertiser_id in (1,3,8,9,12,16,17,18,24,27,31,32,34,40,44,47,48,53,55,57,61,64
,69,78,79,80,81,85,89,91,95,98,102,110,113,114,119,127,128,130,133,134))
The slow part of the query is:
advertiser_id in (1,3,8,9,12,16,17,18,24,27,31,32,34,40,44,47,48,53,55,57,61,64,
69,78,79,80,81,85,89,91,95,98,102,110,113,114,119,127,128,130,133,134))
There is an index on advertiser_id
This is the users preferences and is variable. Without this the query runs blazingly fast.
Please ask for any other information if I've not supplied it and I will add it as soon as I can. Thanks in advance!
Update 1
Here is the Query Plan:
Aggregate (cost=198184.75..198184.76 rows=1 width=0) (actual time=11942.818..11942.818 rows=1 loops=1)
-> Hash Join (cost=8065.09..197269.21 rows=366218 width=0) (actual time=110.651..11821.545 rows=349373 loops=1)
Hash Cond: (products.category_id = categories.id)
-> Bitmap Heap Scan on products (cost=8047.13..192215.75 rows=366218 width=4) (actual time=109.655..11470.039 rows=349373 loops=1)
Recheck Cond: (advertiser_id = ANY ('{1,3,8,9,12,16,17,18,24,27,31,32,34,40,44,47,48,53,55,57,61,64,69,78,79,80,81,85,89,91,95,98,102,110,113,114,119,127,128,130,133,134}'::integer[]))
Rows Removed by Index Recheck: 459153
Filter: ((NOT disabled) AND (merchant_image_url IS NOT NULL))"
Rows Removed by Filter: 112084
-> Bitmap Index Scan on index_products_on_advertiser_id (cost=0.00..7955.57 rows=465290 width=0) (actual time=106.865..106.865 rows=461457 loops=1)
Index Cond: (advertiser_id = ANY ('{1,3,8,9,12,16,17,18,24,27,31,32,34,40,44,47,48,53,55,57,61,64,69,78,79,80,81,85,89,91,95,98,102,110,113,114,119,127,128,130,133,134}'::integer[]))
-> Hash (cost=11.87..11.87 rows=487 width=4) (actual time=0.944..0.944 rows=487 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 18kB
-> Seq Scan on categories (cost=0.00..11.87 rows=487 width=4) (actual time=0.007..0.616 rows=487 loops=1)
Total runtime: 11943.149 ms
and the Explain

Resources