Обсуждение: Unexpected planner choice in simple JOIN
I have a 2 table parent child setup (tab0 -< tab1) with a fairly small (100 K rows) parent and big child (100 M rows).. Exact setup is included below. If I do a simple range scan on small part of the pk of tab0 the planner chooses an index scan (pretty much as expected): test0=# EXPLAIN ANALYZE SELECT t0.id0 FROM tab0 AS t0 WHERE t0.id0 < 5; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------- Index Only Scan using tab0_pkey on tab0 t0 (cost=0.29..4.38 rows=5 width=4) (actual time=0.006..0.007 rows=5.00 loops=1) Index Cond: (id0 < 5) Heap Fetches: 0 Index Searches: 1 Buffers: shared hit=3 Planning: Buffers: shared hit=65 Planning Time: 0.383 ms Execution Time: 0.038 ms (9 rows) However joining it to tab1 changes this to a parallel seq scan: test0=# EXPLAIN ANALYZE SELECT t0.id0, t1.val FROM tab0 AS t0 JOIN tab1 AS t1 ON (t0.id0 = t1.id0) WHERE t0.id0 < 5; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------- Gather (cost=1097.61..42577.77 rows=5000 width=98) (actual time=0.324..7.486 rows=3500.00 loops=1) Workers Planned: 1 Workers Launched: 1 Buffers: shared hit=1751 -> Nested Loop (cost=97.61..41077.77 rows=2941 width=98) (actual time=1.273..3.723 rows=1750.00 loops=2) Buffers: shared hit=1751 -> Parallel Seq Scan on tab0 t0 (cost=0.00..2375.29 rows=3 width=4) (actual time=1.247..3.232 rows=2.50 loops=2) Filter: (id0 < 5) Rows Removed by Filter: 49998 Buffers: shared hit=1640 -> Bitmap Heap Scan on tab1 t1 (cost=97.61..12867.78 rows=3305 width=98) (actual time=0.031..0.148 rows=700.00 loops=5) Recheck Cond: (t0.id0 = id0) Heap Blocks: exact=97 Buffers: shared hit=111 -> Bitmap Index Scan on tab1_id0_hash (cost=0.00..96.79 rows=3305 width=0) (actual time=0.024..0.024 rows=700.00 loops=5) Index Cond: (id0 = t0.id0) Index Searches: 5 Buffers: shared hit=14 Planning: Buffers: shared hit=160 Planning Time: 0.506 ms Execution Time: 7.658 ms (22 rows) However, disabling seq scan gets back to the index scan again, and what looks to be a lower cost overall plan: test0=# SET enable_seqscan=off; SET Time: 0.133 ms test0=# EXPLAIN ANALYZE SELECT t0.id0, t1.val FROM tab0 AS t0 JOIN tab1 AS t1 ON (t0.id0 = t1.id0) WHERE t0.id0 < 5; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=97.91..64508.51 rows=5000 width=98) (actual time=0.044..1.093 rows=3500.00 loops=1) Buffers: shared hit=113 -> Index Only Scan using tab0_pkey on tab0 t0 (cost=0.29..4.38 rows=5 width=4) (actual time=0.004..0.007 rows=5.00 loops=1) Index Cond: (id0 < 5) Heap Fetches: 0 Index Searches: 1 Buffers: shared hit=3 -> Bitmap Heap Scan on tab1 t1 (cost=97.61..12867.78 rows=3305 width=98) (actual time=0.029..0.153 rows=700.00 loops=5) Recheck Cond: (t0.id0 = id0) Heap Blocks: exact=97 Buffers: shared hit=110 -> Bitmap Index Scan on tab1_id0_hash (cost=0.00..96.79 rows=3305 width=0) (actual time=0.022..0.022 rows=700.00 loops=5) Index Cond: (id0 = t0.id0) Index Searches: 5 Buffers: shared hit=13 Planning: Buffers: shared hit=9 Planning Time: 0.187 ms Execution Time: 1.262 ms (19 rows) This is unexpected (to me anyway). This is 19devel from earlier this week, All parameters default. I note that I'm running on NVMe storage, and should ideally lower random_page_cost - but this does not materially effect the plan for these queries. To be fair, both possible plans are pretty fast, I am intrigued that what *looks* like a higher cost plan is being chosen! The Setup ------------- CREATE TABLE tab0 ( id0 SERIAL PRIMARY KEY, val VARCHAR(100) ); CREATE TABLE tab1 ( id1 BIGSERIAL PRIMARY KEY, val VARCHAR(100), id0 INTEGER REFERENCES tab0(id0) ); -- Make 100,000 rows INSERT INTO tab0 (id0, val) SELECT generate_series(0,99999), 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'; -- Make 10,000,000 rows INSERT INTO tab1 (id0, val) SELECT generate_series(0,9999999)/100, 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'; -- Now make another 90,000,000 rows INSERT INTO tab1 (id0, val) SELECT id0, val FROM tab1 LIMIT 10000000; INSERT INTO tab1 (id0, val) SELECT id0, val FROM tab1 LIMIT 10000000; INSERT INTO tab1 (id0, val) SELECT id0, val FROM tab1 LIMIT 10000000; INSERT INTO tab1 (id0, val) SELECT id0, val FROM tab1 LIMIT 10000000; INSERT INTO tab1 (id0, val) SELECT id0, val FROM tab1 LIMIT 10000000; INSERT INTO tab1 (id0, val) SELECT id0, val FROM tab1 LIMIT 10000000; INSERT INTO tab1 (id0, val) SELECT id0, val FROM tab1 LIMIT 10000000; INSERT INTO tab1 (id0, val) SELECT id0, val FROM tab1 LIMIT 10000000; -- Create index on foreign key INSERT INTO tab1 (id0, val) SELECT id0, val FROM tab1 LIMIT 10000000; CREATE INDEX tab1_id0_hash ON tab1 USING HASH (id0); VACUUM; ANALYZE;
This does seem to be related to parallel planning: test0=# SET max_parallel_workers_per_gather=0; SET Time: 0.205 ms test0=# EXPLAIN ANALYZE SELECT t0.id0, t1.val FROM tab0 AS t0 JOIN tab1 AS t1 ON (t0.id0 = t1.id0) WHERE t0.id0 < 5; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=97.91..64508.51 rows=5000 width=98) (actual time=0.217..1.193 rows=3500.00 loops=1) Buffers: shared hit=113 read=1 -> Index Only Scan using tab0_pkey on tab0 t0 (cost=0.29..4.38 rows=5 width=4) (actual time=0.007..0.010 rows=5.00 loops=1) Index Cond: (id0 < 5) Heap Fetches: 0 Index Searches: 1 Buffers: shared hit=3 -> Bitmap Heap Scan on tab1 t1 (cost=97.61..12867.78 rows=3305 width=98) (actual time=0.069..0.187 rows=700.00 loops=5) Recheck Cond: (t0.id0 = id0) Heap Blocks: exact=97 Buffers: shared hit=110 read=1 -> Bitmap Index Scan on tab1_id0_hash (cost=0.00..96.79 rows=3305 width=0) (actual time=0.061..0.061 rows=700.00 loops=5) Index Cond: (id0 = t0.id0) Index Searches: 5 Buffers: shared hit=13 read=1 Planning: Buffers: shared hit=216 read=6 Planning Time: 1.618 ms Execution Time: 1.385 ms (19 rows) Not clear to me why removing 2 workers makes the seqscan more attractive when that part of the plan is 100x more expensive than an index scan.... On 08/01/2026 14:35, Mark Kirkwood wrote: > I have a 2 table parent child setup (tab0 -< tab1) with a fairly small > (100 K rows) parent and big child (100 M rows).. Exact setup is > included below. > > If I do a simple range scan on small part of the pk of tab0 the > planner chooses an index scan (pretty much as expected): > > test0=# EXPLAIN ANALYZE SELECT t0.id0 FROM tab0 AS t0 WHERE t0.id0 < 5; > QUERY PLAN > --------------------------------------------------------------------------------------------------------------------------- > > Index Only Scan using tab0_pkey on tab0 t0 (cost=0.29..4.38 rows=5 > width=4) (actual time=0.006..0.007 rows=5.00 loops=1) > Index Cond: (id0 < 5) > Heap Fetches: 0 > Index Searches: 1 > Buffers: shared hit=3 > Planning: > Buffers: shared hit=65 > Planning Time: 0.383 ms > Execution Time: 0.038 ms > (9 rows) > > However joining it to tab1 changes this to a parallel seq scan: > > test0=# EXPLAIN ANALYZE SELECT t0.id0, t1.val FROM tab0 AS t0 JOIN > tab1 AS t1 ON (t0.id0 = t1.id0) WHERE t0.id0 < 5; > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------------------- > > Gather (cost=1097.61..42577.77 rows=5000 width=98) (actual > time=0.324..7.486 rows=3500.00 loops=1) > Workers Planned: 1 > Workers Launched: 1 > Buffers: shared hit=1751 > -> Nested Loop (cost=97.61..41077.77 rows=2941 width=98) (actual > time=1.273..3.723 rows=1750.00 loops=2) > Buffers: shared hit=1751 > -> Parallel Seq Scan on tab0 t0 (cost=0.00..2375.29 rows=3 > width=4) (actual time=1.247..3.232 rows=2.50 loops=2) > Filter: (id0 < 5) > Rows Removed by Filter: 49998 > Buffers: shared hit=1640 > -> Bitmap Heap Scan on tab1 t1 (cost=97.61..12867.78 > rows=3305 width=98) (actual time=0.031..0.148 rows=700.00 loops=5) > Recheck Cond: (t0.id0 = id0) > Heap Blocks: exact=97 > Buffers: shared hit=111 > -> Bitmap Index Scan on tab1_id0_hash > (cost=0.00..96.79 rows=3305 width=0) (actual time=0.024..0.024 > rows=700.00 loops=5) > Index Cond: (id0 = t0.id0) > Index Searches: 5 > Buffers: shared hit=14 > Planning: > Buffers: shared hit=160 > Planning Time: 0.506 ms > Execution Time: 7.658 ms > (22 rows) > > However, disabling seq scan gets back to the index scan again, and > what looks to be a lower cost overall plan: > > test0=# SET enable_seqscan=off; > SET > Time: 0.133 ms > test0=# EXPLAIN ANALYZE SELECT t0.id0, t1.val FROM tab0 AS t0 JOIN > tab1 AS t1 ON (t0.id0 = t1.id0) WHERE t0.id0 < 5; > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------------- > > Nested Loop (cost=97.91..64508.51 rows=5000 width=98) (actual > time=0.044..1.093 rows=3500.00 loops=1) > Buffers: shared hit=113 > -> Index Only Scan using tab0_pkey on tab0 t0 (cost=0.29..4.38 > rows=5 width=4) (actual time=0.004..0.007 rows=5.00 loops=1) > Index Cond: (id0 < 5) > Heap Fetches: 0 > Index Searches: 1 > Buffers: shared hit=3 > -> Bitmap Heap Scan on tab1 t1 (cost=97.61..12867.78 rows=3305 > width=98) (actual time=0.029..0.153 rows=700.00 loops=5) > Recheck Cond: (t0.id0 = id0) > Heap Blocks: exact=97 > Buffers: shared hit=110 > -> Bitmap Index Scan on tab1_id0_hash (cost=0.00..96.79 > rows=3305 width=0) (actual time=0.022..0.022 rows=700.00 loops=5) > Index Cond: (id0 = t0.id0) > Index Searches: 5 > Buffers: shared hit=13 > Planning: > Buffers: shared hit=9 > Planning Time: 0.187 ms > Execution Time: 1.262 ms > (19 rows) > >
On Thu, 8 Jan 2026 at 16:34, Mark Kirkwood <mark.kirkwood@gmail.com> wrote: > This does seem to be related to parallel planning: Isn't it just a case of hash indexes not allowing parallel scans? David
I don't think so - while the case I posted used a hash index on the child table, exactly the sane behaviour happens if it is a btree (I probably should have mentioned that sorry). Background is I discovered this while playing about with hash indexes...which I must say - someone has done excellent work on as in this *particular cases* they are getting me better query performance! regards Mark On 08/01/2026 16:56, David Rowley wrote: > On Thu, 8 Jan 2026 at 16:34, Mark Kirkwood <mark.kirkwood@gmail.com> wrote: >> This does seem to be related to parallel planning: > Isn't it just a case of hash indexes not allowing parallel scans? > > David
On Thu, 8 Jan 2026 at 17:03, Mark Kirkwood <mark.kirkwood@gmail.com> wrote: > > I don't think so - while the case I posted used a hash index on the > child table, exactly the sane behaviour happens if it is a btree (I > probably should have mentioned that sorry). Background is I discovered > this while playing about with hash indexes...which I must say - someone > has done excellent work on as in this *particular cases* they are > getting me better query performance! Ok, it seems related to the min_parallel_index_scan_size GUC. If you zero that, do you get a better plan? I think the problem is that because the best form of plan for joining this tiny set of rows to the huge table is a parameterised nested loop, to parallelise that loop, you need a Parallel node on the outer side of the Nested Loop. If the index's size is below min_parallel_index_scan_size then we won't build a partial path for it. David
Good suggestion. The results are...interesting: test0=# SET min_parallel_index_scan_size =0; SET Time: 0.172 ms test0=# EXPLAIN ANALYZE SELECT t0.id0, t1.val FROM tab0 AS t0 JOIN tab1 AS t1 ON (t0.id0 = t1.id0) WHERE t0.id0 < 5; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------ Gather (cost=1097.91..40206.84 rows=5000 width=98) (actual time=0.362..5.565 rows=3500.00 loops=1) Workers Planned: 1 Workers Launched: 1 Buffers: shared hit=114 -> Nested Loop (cost=97.91..38706.84 rows=2941 width=98) (actual time=0.034..0.479 rows=1750.00 loops=2) Buffers: shared hit=114 -> Parallel Index Only Scan using tab0_pkey on tab0 t0 (cost=0.29..4.36 rows=3 width=4) (actual time=0.008..0.009 rows=2.50 loops=2) Index Cond: (id0 < 5) Heap Fetches: 0 Index Searches: 1 Buffers: shared hit=3 -> Bitmap Heap Scan on tab1 t1 (cost=97.61..12867.78 rows=3305 width=98) (actual time=0.036..0.140 rows=700.00 loops=5) Recheck Cond: (t0.id0 = id0) Heap Blocks: exact=97 Buffers: shared hit=111 -> Bitmap Index Scan on tab1_id0_hash (cost=0.00..96.79 rows=3305 width=0) (actual time=0.030..0.030 rows=700.00 loops=5) Index Cond: (id0 = t0.id0) Index Searches: 5 Buffers: shared hit=14 Planning: Buffers: shared hit=222 Planning Time: 0.763 ms Execution Time: 5.716 ms (23 rows) Time: 7.248 ms test0=# SET max_parallel_workers_per_gather=0; SET Time: 0.131 ms test0=# EXPLAIN ANALYZE SELECT t0.id0, t1.val FROM tab0 AS t0 JOIN tab1 AS t1 ON (t0.id0 = t1.id0) WHERE t0.id0 < 5; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=97.91..64508.51 rows=5000 width=98) (actual time=0.044..0.903 rows=3500.00 loops=1) Buffers: shared hit=113 -> Index Only Scan using tab0_pkey on tab0 t0 (cost=0.29..4.38 rows=5 width=4) (actual time=0.003..0.005 rows=5.00 loops=1) Index Cond: (id0 < 5) Heap Fetches: 0 Index Searches: 1 Buffers: shared hit=3 -> Bitmap Heap Scan on tab1 t1 (cost=97.61..12867.78 rows=3305 width=98) (actual time=0.028..0.132 rows=700.00 loops=5) Recheck Cond: (t0.id0 = id0) Heap Blocks: exact=97 Buffers: shared hit=110 -> Bitmap Index Scan on tab1_id0_hash (cost=0.00..96.79 rows=3305 width=0) (actual time=0.021..0.021 rows=700.00 loops=5) Index Cond: (id0 = t0.id0) Index Searches: 5 Buffers: shared hit=13 Planning: Buffers: shared hit=9 Planning Time: 0.190 ms Execution Time: 1.025 ms (19 rows) Time: 1.459 ms However disabling gather workers gets a much better plan. Now I can switch the child index to btree if you think that is significant. Best wishes Mark On 08/01/2026 17:14, David Rowley wrote: > On Thu, 8 Jan 2026 at 17:03, Mark Kirkwood <mark.kirkwood@gmail.com> wrote: >> I don't think so - while the case I posted used a hash index on the >> child table, exactly the sane behaviour happens if it is a btree (I >> probably should have mentioned that sorry). Background is I discovered >> this while playing about with hash indexes...which I must say - someone >> has done excellent work on as in this *particular cases* they are >> getting me better query performance! > Ok, it seems related to the min_parallel_index_scan_size GUC. If you > zero that, do you get a better plan? > > I think the problem is that because the best form of plan for joining > this tiny set of rows to the huge table is a parameterised nested > loop, to parallelise that loop, you need a Parallel node on the outer > side of the Nested Loop. If the index's size is below > min_parallel_index_scan_size then we won't build a partial path for > it. > > David
Mark Kirkwood <mark.kirkwood@gmail.com> writes:
> Good suggestion. The results are...interesting:
It's clear that the planner is underestimating the costs of using a
parallel worker in your setup. Try increasing parallel_setup_cost
and/or parallel_tuple_cost to bring things more in line with reality.
regards, tom lane
Yeah, that seems to do the trick: test0=# SET parallel_setup_cost=24000; SET Time: 0.150 ms test0=# EXPLAIN ANALYZE SELECT t0.id0, t1.val FROM tab0 AS t0 JOIN tab1 AS t1 ON (t0.id0 = t1.id0) WHERE t0.id0 < 5; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=97.84..64355.16 rows=5000 width=98) (actual time=0.038..0.819 rows=3500.00 loops=1) Buffers: shared hit=113 -> Index Only Scan using tab0_pkey on tab0 t0 (cost=0.29..4.38 rows=5 width=4) (actual time=0.003..0.005 rows=5.00 loops=1) Index Cond: (id0 < 5) Heap Fetches: 0 Index Searches: 1 Buffers: shared hit=3 -> Bitmap Heap Scan on tab1 t1 (cost=97.55..12837.19 rows=3297 width=98) (actual time=0.023..0.111 rows=700.00 loops=5) Recheck Cond: (t0.id0 = id0) Heap Blocks: exact=97 Buffers: shared hit=110 -> Bitmap Index Scan on tab1_id0_hash (cost=0.00..96.73 rows=3297 width=0) (actual time=0.018..0.018 rows=700.00 loops=5) Index Cond: (id0 = t0.id0) Index Searches: 5 Buffers: shared hit=13 Planning: Buffers: shared hit=9 Planning Time: 0.168 ms Execution Time: 0.935 ms (19 rows) A point comes to mind - this is not a particularly unusual setup (i.e relatively small parent table with big child one), so maybe the defaults are not ideal here? Anyway,, thanks for your help! regards Mark On 08/01/2026 17:34, Tom Lane wrote: > Mark Kirkwood <mark.kirkwood@gmail.com> writes: >> Good suggestion. The results are...interesting: > It's clear that the planner is underestimating the costs of using a > parallel worker in your setup. Try increasing parallel_setup_cost > and/or parallel_tuple_cost to bring things more in line with reality. > > regards, tom lane
Mark Kirkwood <mark.kirkwood@gmail.com> writes:
> A point comes to mind - this is not a particularly unusual setup (i.e
> relatively small parent table with big child one), so maybe the defaults
> are not ideal here?
Very probably. To my mind, the default costs for parallel query and
JIT are both unduly optimistic and tend to drive the planner to use
those features when you'd be better off without. The reason there's
not been more argument about them is that the downside of using those
features on a too-small query is bounded, while the upside of using
them on very-big queries isn't. So nobody's invested the effort to
gather enough evidence to back choosing a different set of defaults.
regards, tom lane
Right makes sense - as I noted...the 'wrong' plan is still pretty fast... On 08/01/2026 17:51, Tom Lane wrote: > Mark Kirkwood <mark.kirkwood@gmail.com> writes: >> A point comes to mind - this is not a particularly unusual setup (i.e >> relatively small parent table with big child one), so maybe the defaults >> are not ideal here? > Very probably. To my mind, the default costs for parallel query and > JIT are both unduly optimistic and tend to drive the planner to use > those features when you'd be better off without. The reason there's > not been more argument about them is that the downside of using those > features on a too-small query is bounded, while the upside of using > them on very-big queries isn't. So nobody's invested the effort to > gather enough evidence to back choosing a different set of defaults. > > regards, tom lane