Обсуждение: Unexpected planner choice in simple JOIN

Поиск
Список
Период
Сортировка

Unexpected planner choice in simple JOIN

От
Mark Kirkwood
Дата:
I have a 2 table parent child setup (tab0 -< tab1) with a fairly small 
(100 K rows) parent and big child (100 M rows).. Exact setup is included 
below.

If I do a simple range scan on small part of the pk of tab0  the planner 
chooses an index scan (pretty much as expected):

test0=# EXPLAIN ANALYZE SELECT t0.id0 FROM tab0 AS t0 WHERE t0.id0 < 5;
                                                         QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------
  Index Only Scan using tab0_pkey on tab0 t0  (cost=0.29..4.38 rows=5 
width=4) (actual time=0.006..0.007 rows=5.00 loops=1)
    Index Cond: (id0 < 5)
    Heap Fetches: 0
    Index Searches: 1
    Buffers: shared hit=3
  Planning:
    Buffers: shared hit=65
  Planning Time: 0.383 ms
  Execution Time: 0.038 ms
(9 rows)

However joining it to tab1 changes this to a parallel seq scan:

test0=# EXPLAIN ANALYZE SELECT t0.id0, t1.val FROM tab0 AS t0 JOIN tab1 
AS t1 ON (t0.id0 = t1.id0) WHERE t0.id0 < 5;
QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------
  Gather  (cost=1097.61..42577.77 rows=5000 width=98) (actual 
time=0.324..7.486 rows=3500.00 loops=1)
    Workers Planned: 1
    Workers Launched: 1
    Buffers: shared hit=1751
    ->  Nested Loop  (cost=97.61..41077.77 rows=2941 width=98) (actual 
time=1.273..3.723 rows=1750.00 loops=2)
          Buffers: shared hit=1751
          ->  Parallel Seq Scan on tab0 t0 (cost=0.00..2375.29 rows=3 
width=4) (actual time=1.247..3.232 rows=2.50 loops=2)
                Filter: (id0 < 5)
                Rows Removed by Filter: 49998
                Buffers: shared hit=1640
          ->  Bitmap Heap Scan on tab1 t1 (cost=97.61..12867.78 
rows=3305 width=98) (actual time=0.031..0.148 rows=700.00 loops=5)
                Recheck Cond: (t0.id0 = id0)
                Heap Blocks: exact=97
                Buffers: shared hit=111
                ->  Bitmap Index Scan on tab1_id0_hash (cost=0.00..96.79 
rows=3305 width=0) (actual time=0.024..0.024 rows=700.00 loops=5)
                      Index Cond: (id0 = t0.id0)
                      Index Searches: 5
                      Buffers: shared hit=14
  Planning:
    Buffers: shared hit=160
  Planning Time: 0.506 ms
  Execution Time: 7.658 ms
(22 rows)

However, disabling seq scan gets back to the index scan again, and what 
looks to be a lower cost overall plan:

test0=# SET enable_seqscan=off;
SET
Time: 0.133 ms
test0=# EXPLAIN ANALYZE SELECT t0.id0, t1.val FROM tab0 AS t0 JOIN tab1 
AS t1 ON (t0.id0 = t1.id0) WHERE t0.id0 < 5;
  QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
  Nested Loop  (cost=97.91..64508.51 rows=5000 width=98) (actual 
time=0.044..1.093 rows=3500.00 loops=1)
    Buffers: shared hit=113
    ->  Index Only Scan using tab0_pkey on tab0 t0 (cost=0.29..4.38 
rows=5 width=4) (actual time=0.004..0.007 rows=5.00 loops=1)
          Index Cond: (id0 < 5)
          Heap Fetches: 0
          Index Searches: 1
          Buffers: shared hit=3
    ->  Bitmap Heap Scan on tab1 t1  (cost=97.61..12867.78 rows=3305 
width=98) (actual time=0.029..0.153 rows=700.00 loops=5)
          Recheck Cond: (t0.id0 = id0)
          Heap Blocks: exact=97
          Buffers: shared hit=110
          ->  Bitmap Index Scan on tab1_id0_hash (cost=0.00..96.79 
rows=3305 width=0) (actual time=0.022..0.022 rows=700.00 loops=5)
                Index Cond: (id0 = t0.id0)
                Index Searches: 5
                Buffers: shared hit=13
  Planning:
    Buffers: shared hit=9
  Planning Time: 0.187 ms
  Execution Time: 1.262 ms
(19 rows)

This is unexpected (to me anyway). This is 19devel from earlier this 
week, All parameters default. I note that I'm running on NVMe storage, 
and should ideally lower random_page_cost - but this does not materially 
effect the plan for these queries. To be fair, both possible plans are 
pretty fast, I am intrigued that what *looks* like a higher cost plan is 
being chosen!

The Setup

-------------

CREATE TABLE tab0 (
   id0 SERIAL PRIMARY KEY,
   val VARCHAR(100)
);
CREATE TABLE tab1 (
   id1 BIGSERIAL PRIMARY KEY,
   val VARCHAR(100),
   id0 INTEGER REFERENCES tab0(id0)
);

-- Make 100,000 rows

INSERT INTO tab0 (id0, val)
SELECT generate_series(0,99999), 
'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx';

-- Make 10,000,000 rows

INSERT INTO tab1 (id0, val)
SELECT generate_series(0,9999999)/100, 
'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx';

-- Now make another 90,000,000 rows

INSERT INTO tab1 (id0, val)
SELECT id0, val FROM tab1 LIMIT 10000000;

INSERT INTO tab1 (id0, val)
SELECT id0, val FROM tab1 LIMIT 10000000;

INSERT INTO tab1 (id0, val)
SELECT id0, val FROM tab1 LIMIT 10000000;

INSERT INTO tab1 (id0, val)
SELECT id0, val FROM tab1 LIMIT 10000000;

INSERT INTO tab1 (id0, val)
SELECT id0, val FROM tab1 LIMIT 10000000;

INSERT INTO tab1 (id0, val)
SELECT id0, val FROM tab1 LIMIT 10000000;

INSERT INTO tab1 (id0, val)
SELECT id0, val FROM tab1 LIMIT 10000000;

INSERT INTO tab1 (id0, val)
SELECT id0, val FROM tab1 LIMIT 10000000;

-- Create index on foreign key

INSERT INTO tab1 (id0, val)
SELECT id0, val FROM tab1 LIMIT 10000000;
CREATE INDEX tab1_id0_hash ON tab1 USING HASH (id0);

VACUUM;
ANALYZE;




Re: Unexpected planner choice in simple JOIN

От
Mark Kirkwood
Дата:
This does seem to be related to parallel planning:

test0=# SET max_parallel_workers_per_gather=0;
SET
Time: 0.205 ms
test0=# EXPLAIN ANALYZE SELECT t0.id0, t1.val FROM tab0 AS t0 JOIN tab1 
AS t1 ON (t0.id0 = t1.id0) WHERE t0.id0 < 5;
  QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
  Nested Loop  (cost=97.91..64508.51 rows=5000 width=98) (actual 
time=0.217..1.193 rows=3500.00 loops=1)
    Buffers: shared hit=113 read=1
    ->  Index Only Scan using tab0_pkey on tab0 t0 (cost=0.29..4.38 
rows=5 width=4) (actual time=0.007..0.010 rows=5.00 loops=1)
          Index Cond: (id0 < 5)
          Heap Fetches: 0
          Index Searches: 1
          Buffers: shared hit=3
    ->  Bitmap Heap Scan on tab1 t1  (cost=97.61..12867.78 rows=3305 
width=98) (actual time=0.069..0.187 rows=700.00 loops=5)
          Recheck Cond: (t0.id0 = id0)
          Heap Blocks: exact=97
          Buffers: shared hit=110 read=1
          ->  Bitmap Index Scan on tab1_id0_hash (cost=0.00..96.79 
rows=3305 width=0) (actual time=0.061..0.061 rows=700.00 loops=5)
                Index Cond: (id0 = t0.id0)
                Index Searches: 5
                Buffers: shared hit=13 read=1
  Planning:
    Buffers: shared hit=216 read=6
  Planning Time: 1.618 ms
  Execution Time: 1.385 ms
(19 rows)

Not clear to me why removing 2 workers makes the seqscan more attractive 
when that part of the plan is 100x more expensive than an index scan....

On 08/01/2026 14:35, Mark Kirkwood wrote:
> I have a 2 table parent child setup (tab0 -< tab1) with a fairly small 
> (100 K rows) parent and big child (100 M rows).. Exact setup is 
> included below.
>
> If I do a simple range scan on small part of the pk of tab0  the 
> planner chooses an index scan (pretty much as expected):
>
> test0=# EXPLAIN ANALYZE SELECT t0.id0 FROM tab0 AS t0 WHERE t0.id0 < 5;
>                                                         QUERY PLAN
>
---------------------------------------------------------------------------------------------------------------------------

>
>  Index Only Scan using tab0_pkey on tab0 t0  (cost=0.29..4.38 rows=5 
> width=4) (actual time=0.006..0.007 rows=5.00 loops=1)
>    Index Cond: (id0 < 5)
>    Heap Fetches: 0
>    Index Searches: 1
>    Buffers: shared hit=3
>  Planning:
>    Buffers: shared hit=65
>  Planning Time: 0.383 ms
>  Execution Time: 0.038 ms
> (9 rows)
>
> However joining it to tab1 changes this to a parallel seq scan:
>
> test0=# EXPLAIN ANALYZE SELECT t0.id0, t1.val FROM tab0 AS t0 JOIN 
> tab1 AS t1 ON (t0.id0 = t1.id0) WHERE t0.id0 < 5;
> QUERY PLAN
>
-------------------------------------------------------------------------------------------------------------------------------------------

>
>  Gather  (cost=1097.61..42577.77 rows=5000 width=98) (actual 
> time=0.324..7.486 rows=3500.00 loops=1)
>    Workers Planned: 1
>    Workers Launched: 1
>    Buffers: shared hit=1751
>    ->  Nested Loop  (cost=97.61..41077.77 rows=2941 width=98) (actual 
> time=1.273..3.723 rows=1750.00 loops=2)
>          Buffers: shared hit=1751
>          ->  Parallel Seq Scan on tab0 t0 (cost=0.00..2375.29 rows=3 
> width=4) (actual time=1.247..3.232 rows=2.50 loops=2)
>                Filter: (id0 < 5)
>                Rows Removed by Filter: 49998
>                Buffers: shared hit=1640
>          ->  Bitmap Heap Scan on tab1 t1 (cost=97.61..12867.78 
> rows=3305 width=98) (actual time=0.031..0.148 rows=700.00 loops=5)
>                Recheck Cond: (t0.id0 = id0)
>                Heap Blocks: exact=97
>                Buffers: shared hit=111
>                ->  Bitmap Index Scan on tab1_id0_hash 
> (cost=0.00..96.79 rows=3305 width=0) (actual time=0.024..0.024 
> rows=700.00 loops=5)
>                      Index Cond: (id0 = t0.id0)
>                      Index Searches: 5
>                      Buffers: shared hit=14
>  Planning:
>    Buffers: shared hit=160
>  Planning Time: 0.506 ms
>  Execution Time: 7.658 ms
> (22 rows)
>
> However, disabling seq scan gets back to the index scan again, and 
> what looks to be a lower cost overall plan:
>
> test0=# SET enable_seqscan=off;
> SET
> Time: 0.133 ms
> test0=# EXPLAIN ANALYZE SELECT t0.id0, t1.val FROM tab0 AS t0 JOIN 
> tab1 AS t1 ON (t0.id0 = t1.id0) WHERE t0.id0 < 5;
>  QUERY PLAN
>
-------------------------------------------------------------------------------------------------------------------------------------

>
>  Nested Loop  (cost=97.91..64508.51 rows=5000 width=98) (actual 
> time=0.044..1.093 rows=3500.00 loops=1)
>    Buffers: shared hit=113
>    ->  Index Only Scan using tab0_pkey on tab0 t0 (cost=0.29..4.38 
> rows=5 width=4) (actual time=0.004..0.007 rows=5.00 loops=1)
>          Index Cond: (id0 < 5)
>          Heap Fetches: 0
>          Index Searches: 1
>          Buffers: shared hit=3
>    ->  Bitmap Heap Scan on tab1 t1  (cost=97.61..12867.78 rows=3305 
> width=98) (actual time=0.029..0.153 rows=700.00 loops=5)
>          Recheck Cond: (t0.id0 = id0)
>          Heap Blocks: exact=97
>          Buffers: shared hit=110
>          ->  Bitmap Index Scan on tab1_id0_hash (cost=0.00..96.79 
> rows=3305 width=0) (actual time=0.022..0.022 rows=700.00 loops=5)
>                Index Cond: (id0 = t0.id0)
>                Index Searches: 5
>                Buffers: shared hit=13
>  Planning:
>    Buffers: shared hit=9
>  Planning Time: 0.187 ms
>  Execution Time: 1.262 ms
> (19 rows)
>
>



Re: Unexpected planner choice in simple JOIN

От
David Rowley
Дата:
On Thu, 8 Jan 2026 at 16:34, Mark Kirkwood <mark.kirkwood@gmail.com> wrote:
> This does seem to be related to parallel planning:

Isn't it just a case of hash indexes not allowing parallel scans?

David



Re: Unexpected planner choice in simple JOIN

От
Mark Kirkwood
Дата:
I don't think so - while the case I posted used a hash index on the 
child table, exactly the sane behaviour happens if it is a btree (I 
probably should have mentioned that sorry). Background is I discovered 
this while playing about with hash indexes...which I must say - someone 
has done excellent work on as in this *particular cases* they are 
getting me better query performance!

regards

Mark

On 08/01/2026 16:56, David Rowley wrote:
> On Thu, 8 Jan 2026 at 16:34, Mark Kirkwood <mark.kirkwood@gmail.com> wrote:
>> This does seem to be related to parallel planning:
> Isn't it just a case of hash indexes not allowing parallel scans?
>
> David



Re: Unexpected planner choice in simple JOIN

От
David Rowley
Дата:
On Thu, 8 Jan 2026 at 17:03, Mark Kirkwood <mark.kirkwood@gmail.com> wrote:
>
> I don't think so - while the case I posted used a hash index on the
> child table, exactly the sane behaviour happens if it is a btree (I
> probably should have mentioned that sorry). Background is I discovered
> this while playing about with hash indexes...which I must say - someone
> has done excellent work on as in this *particular cases* they are
> getting me better query performance!

Ok, it seems related to the min_parallel_index_scan_size GUC. If you
zero that, do you get a better plan?

I think the problem is that because the best form of plan for joining
this tiny set of rows to the huge table is a parameterised nested
loop, to parallelise that loop, you need a Parallel node on the outer
side of the Nested Loop. If the index's size is below
min_parallel_index_scan_size then we won't build a partial path for
it.

David



Re: Unexpected planner choice in simple JOIN

От
Mark Kirkwood
Дата:
Good suggestion. The results are...interesting:

test0=# SET min_parallel_index_scan_size =0;
SET
Time: 0.172 ms
test0=# EXPLAIN ANALYZE SELECT t0.id0, t1.val FROM tab0 AS t0 JOIN tab1 
AS t1 ON (t0.id0 = t1.id0) WHERE t0.id0 < 5;
    QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------
  Gather  (cost=1097.91..40206.84 rows=5000 width=98) (actual 
time=0.362..5.565 rows=3500.00 loops=1)
    Workers Planned: 1
    Workers Launched: 1
    Buffers: shared hit=114
    ->  Nested Loop  (cost=97.91..38706.84 rows=2941 width=98) (actual 
time=0.034..0.479 rows=1750.00 loops=2)
          Buffers: shared hit=114
          ->  Parallel Index Only Scan using tab0_pkey on tab0 t0  
(cost=0.29..4.36 rows=3 width=4) (actual time=0.008..0.009 rows=2.50 
loops=2)
                Index Cond: (id0 < 5)
                Heap Fetches: 0
                Index Searches: 1
                Buffers: shared hit=3
          ->  Bitmap Heap Scan on tab1 t1 (cost=97.61..12867.78 
rows=3305 width=98) (actual time=0.036..0.140 rows=700.00 loops=5)
                Recheck Cond: (t0.id0 = id0)
                Heap Blocks: exact=97
                Buffers: shared hit=111
                ->  Bitmap Index Scan on tab1_id0_hash (cost=0.00..96.79 
rows=3305 width=0) (actual time=0.030..0.030 rows=700.00 loops=5)
                      Index Cond: (id0 = t0.id0)
                      Index Searches: 5
                      Buffers: shared hit=14
  Planning:
    Buffers: shared hit=222
  Planning Time: 0.763 ms
  Execution Time: 5.716 ms
(23 rows)

Time: 7.248 ms
test0=# SET max_parallel_workers_per_gather=0;
SET
Time: 0.131 ms
test0=# EXPLAIN ANALYZE SELECT t0.id0, t1.val FROM tab0 AS t0 JOIN tab1 
AS t1 ON (t0.id0 = t1.id0) WHERE t0.id0 < 5;
  QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
  Nested Loop  (cost=97.91..64508.51 rows=5000 width=98) (actual 
time=0.044..0.903 rows=3500.00 loops=1)
    Buffers: shared hit=113
    ->  Index Only Scan using tab0_pkey on tab0 t0 (cost=0.29..4.38 
rows=5 width=4) (actual time=0.003..0.005 rows=5.00 loops=1)
          Index Cond: (id0 < 5)
          Heap Fetches: 0
          Index Searches: 1
          Buffers: shared hit=3
    ->  Bitmap Heap Scan on tab1 t1  (cost=97.61..12867.78 rows=3305 
width=98) (actual time=0.028..0.132 rows=700.00 loops=5)
          Recheck Cond: (t0.id0 = id0)
          Heap Blocks: exact=97
          Buffers: shared hit=110
          ->  Bitmap Index Scan on tab1_id0_hash (cost=0.00..96.79 
rows=3305 width=0) (actual time=0.021..0.021 rows=700.00 loops=5)
                Index Cond: (id0 = t0.id0)
                Index Searches: 5
                Buffers: shared hit=13
  Planning:
    Buffers: shared hit=9
  Planning Time: 0.190 ms
  Execution Time: 1.025 ms
(19 rows)

Time: 1.459 ms

However disabling gather workers gets a much better plan. Now I can 
switch the child index to btree if you think that is significant. Best 
wishes

Mark

On 08/01/2026 17:14, David Rowley wrote:
> On Thu, 8 Jan 2026 at 17:03, Mark Kirkwood <mark.kirkwood@gmail.com> wrote:
>> I don't think so - while the case I posted used a hash index on the
>> child table, exactly the sane behaviour happens if it is a btree (I
>> probably should have mentioned that sorry). Background is I discovered
>> this while playing about with hash indexes...which I must say - someone
>> has done excellent work on as in this *particular cases* they are
>> getting me better query performance!
> Ok, it seems related to the min_parallel_index_scan_size GUC. If you
> zero that, do you get a better plan?
>
> I think the problem is that because the best form of plan for joining
> this tiny set of rows to the huge table is a parameterised nested
> loop, to parallelise that loop, you need a Parallel node on the outer
> side of the Nested Loop. If the index's size is below
> min_parallel_index_scan_size then we won't build a partial path for
> it.
>
> David



Re: Unexpected planner choice in simple JOIN

От
Tom Lane
Дата:
Mark Kirkwood <mark.kirkwood@gmail.com> writes:
> Good suggestion. The results are...interesting:

It's clear that the planner is underestimating the costs of using a
parallel worker in your setup.  Try increasing parallel_setup_cost
and/or parallel_tuple_cost to bring things more in line with reality.

            regards, tom lane



Re: Unexpected planner choice in simple JOIN

От
Mark Kirkwood
Дата:
Yeah, that seems to do the trick:

test0=# SET parallel_setup_cost=24000;
SET
Time: 0.150 ms
test0=# EXPLAIN ANALYZE SELECT t0.id0, t1.val FROM tab0 AS t0 JOIN tab1 
AS t1 ON (t0.id0 = t1.id0) WHERE t0.id0 < 5;
  QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
  Nested Loop  (cost=97.84..64355.16 rows=5000 width=98) (actual 
time=0.038..0.819 rows=3500.00 loops=1)
    Buffers: shared hit=113
    ->  Index Only Scan using tab0_pkey on tab0 t0 (cost=0.29..4.38 
rows=5 width=4) (actual time=0.003..0.005 rows=5.00 loops=1)
          Index Cond: (id0 < 5)
          Heap Fetches: 0
          Index Searches: 1
          Buffers: shared hit=3
    ->  Bitmap Heap Scan on tab1 t1  (cost=97.55..12837.19 rows=3297 
width=98) (actual time=0.023..0.111 rows=700.00 loops=5)
          Recheck Cond: (t0.id0 = id0)
          Heap Blocks: exact=97
          Buffers: shared hit=110
          ->  Bitmap Index Scan on tab1_id0_hash (cost=0.00..96.73 
rows=3297 width=0) (actual time=0.018..0.018 rows=700.00 loops=5)
                Index Cond: (id0 = t0.id0)
                Index Searches: 5
                Buffers: shared hit=13
  Planning:
    Buffers: shared hit=9
  Planning Time: 0.168 ms
  Execution Time: 0.935 ms
(19 rows)

A point comes to mind - this is not a particularly unusual setup (i.e 
relatively small parent table with big child one), so maybe the defaults 
are not ideal here?

Anyway,, thanks for your help!

regards

Mark

On 08/01/2026 17:34, Tom Lane wrote:
> Mark Kirkwood <mark.kirkwood@gmail.com> writes:
>> Good suggestion. The results are...interesting:
> It's clear that the planner is underestimating the costs of using a
> parallel worker in your setup.  Try increasing parallel_setup_cost
> and/or parallel_tuple_cost to bring things more in line with reality.
>
>             regards, tom lane



Re: Unexpected planner choice in simple JOIN

От
Tom Lane
Дата:
Mark Kirkwood <mark.kirkwood@gmail.com> writes:
> A point comes to mind - this is not a particularly unusual setup (i.e 
> relatively small parent table with big child one), so maybe the defaults 
> are not ideal here?

Very probably.  To my mind, the default costs for parallel query and
JIT are both unduly optimistic and tend to drive the planner to use
those features when you'd be better off without.  The reason there's
not been more argument about them is that the downside of using those
features on a too-small query is bounded, while the upside of using
them on very-big queries isn't.  So nobody's invested the effort to
gather enough evidence to back choosing a different set of defaults.

            regards, tom lane



Re: Unexpected planner choice in simple JOIN

От
Mark Kirkwood
Дата:
Right makes sense - as I noted...the 'wrong' plan is still pretty fast...

On 08/01/2026 17:51, Tom Lane wrote:
> Mark Kirkwood <mark.kirkwood@gmail.com> writes:
>> A point comes to mind - this is not a particularly unusual setup (i.e
>> relatively small parent table with big child one), so maybe the defaults
>> are not ideal here?
> Very probably.  To my mind, the default costs for parallel query and
> JIT are both unduly optimistic and tend to drive the planner to use
> those features when you'd be better off without.  The reason there's
> not been more argument about them is that the downside of using those
> features on a too-small query is bounded, while the upside of using
> them on very-big queries isn't.  So nobody's invested the effort to
> gather enough evidence to back choosing a different set of defaults.
>
>             regards, tom lane