Обсуждение: [MASSMAIL]apply_scanjoin_target_to_paths and partitionwise join
* If the rel is partitioned, we want to drop its existing paths and
* generate new ones. This function would still be correct if we kept the
* existing paths: we'd modify them to generate the correct target above
* the partitioning Append, and then they'd compete on cost with paths
* generating the target below the Append
if (rel_is_partitioned)
rel->pathlist = NIL;
Вложения
Hi All,Per below code and comment in apply_scanjoin_target_to_paths(), the function zaps all the paths of a partitioned relation./*
* If the rel is partitioned, we want to drop its existing paths and
* generate new ones. This function would still be correct if we kept the
* existing paths: we'd modify them to generate the correct target above
* the partitioning Append, and then they'd compete on cost with paths
* generating the target below the Append... snip ...*/
if (rel_is_partitioned)
rel->pathlist = NIL;Later the function adjusts the targets of paths in child relations and constructs Append paths from them. That works for simple partitioned relations but not for join between partitioned relations. When enable_partitionwise_join is true, the joinrel representing a join between partitioned relations may have join paths joining append paths and Append paths containing child join paths. Once we zap the pathlist, the only paths that can be computed again are the Append paths. If the optimal path, before applying the new target, was a join of append paths it will be lost forever. This will result in choosing a suboptimal Append path.We have one such query in our regression set.SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.c = t2.c)) FULL JOIN plt3_adv t3 ON (t1.c = t3.c) WHERE coalesce(t1.a, 0 ) % 5 != 3 AND coalesce(t1.a, 0) % 5 != 4 ORDER BY t1.c, t1.a, t2.a, t3.a;For this query, the cheapest Append of Joins path has cost 24.97..25.57 and the cheapest Join of Appends path has cost 21.29..21.81. The latter should be chosen even though enable_partitionwise_join is ON. But this function chooses the first.The solution is to zap the pathlists only for simple partitioned relations like the attached patch.With this patch above query does not choose non-partitionwise join path and partition_join test fails. That's expected. But we need to replace that query with some query which uses partitionwise join while maintaining the conditions of the test as explained in the comment above that query. I have tried a few variations but without success. Suggestions welcome.The problem is reproducible on PG 15. The patch is based on 15_STABLE branch. But the problem exists in recent branches as well.
--
On Thu, Apr 11, 2024 at 12:07 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:Hi All,Per below code and comment in apply_scanjoin_target_to_paths(), the function zaps all the paths of a partitioned relation./*
* If the rel is partitioned, we want to drop its existing paths and
* generate new ones. This function would still be correct if we kept the
* existing paths: we'd modify them to generate the correct target above
* the partitioning Append, and then they'd compete on cost with paths
* generating the target below the Append... snip ...*/
if (rel_is_partitioned)
rel->pathlist = NIL;Later the function adjusts the targets of paths in child relations and constructs Append paths from them. That works for simple partitioned relations but not for join between partitioned relations. When enable_partitionwise_join is true, the joinrel representing a join between partitioned relations may have join paths joining append paths and Append paths containing child join paths. Once we zap the pathlist, the only paths that can be computed again are the Append paths. If the optimal path, before applying the new target, was a join of append paths it will be lost forever. This will result in choosing a suboptimal Append path.We have one such query in our regression set.SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.c = t2.c)) FULL JOIN plt3_adv t3 ON (t1.c = t3.c) WHERE coalesce(t1.a, 0 ) % 5 != 3 AND coalesce(t1.a, 0) % 5 != 4 ORDER BY t1.c, t1.a, t2.a, t3.a;For this query, the cheapest Append of Joins path has cost 24.97..25.57 and the cheapest Join of Appends path has cost 21.29..21.81. The latter should be chosen even though enable_partitionwise_join is ON. But this function chooses the first.The solution is to zap the pathlists only for simple partitioned relations like the attached patch.With this patch above query does not choose non-partitionwise join path and partition_join test fails. That's expected. But we need to replace that query with some query which uses partitionwise join while maintaining the conditions of the test as explained in the comment above that query. I have tried a few variations but without success. Suggestions welcome.
Вложения
Hi Ashutosh & hackers, On Mon, Apr 15, 2024 at 9:00 AM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > Here's patch with > [..] > Adding to the next commitfest but better to consider this for the next set of minor releases. 1. The patch does not pass cfbot - https://cirrus-ci.com/task/5486258451906560 on master due to test failure "not ok 206 + partition_join" 2. Without the patch applied, the result of the meson test on master was clean (no failures , so master is fine). After applying patch there were expected some hunk failures (as the patch was created for 15_STABLE): patching file src/backend/optimizer/plan/planner.c Hunk #1 succeeded at 7567 (offset 468 lines). Hunk #2 succeeded at 7593 (offset 468 lines). patching file src/test/regress/expected/partition_join.out Hunk #1 succeeded at 4777 (offset 56 lines). Hunk #2 succeeded at 4867 (offset 56 lines). patching file src/test/regress/sql/partition_join.sql Hunk #1 succeeded at 1136 (offset 1 line). 3. Without patch there is performance regression/bug on master (cost is higher with enable_partitionwise_join=on that without it): data preparation: -- Test the process_outer_partition() code path CREATE TABLE plt1_adv (a int, b int, c text) PARTITION BY LIST (c); CREATE TABLE plt1_adv_p1 PARTITION OF plt1_adv FOR VALUES IN ('0000', '0001', '0002'); CREATE TABLE plt1_adv_p2 PARTITION OF plt1_adv FOR VALUES IN ('0003', '0004'); INSERT INTO plt1_adv SELECT i, i, to_char(i % 5, 'FM0000') FROM generate_series(0, 24) i; ANALYZE plt1_adv; CREATE TABLE plt2_adv (a int, b int, c text) PARTITION BY LIST (c); CREATE TABLE plt2_adv_p1 PARTITION OF plt2_adv FOR VALUES IN ('0002'); CREATE TABLE plt2_adv_p2 PARTITION OF plt2_adv FOR VALUES IN ('0003', '0004'); INSERT INTO plt2_adv SELECT i, i, to_char(i % 5, 'FM0000') FROM generate_series(0, 24) i WHERE i % 5 IN (2, 3, 4); ANALYZE plt2_adv; CREATE TABLE plt3_adv (a int, b int, c text) PARTITION BY LIST (c); CREATE TABLE plt3_adv_p1 PARTITION OF plt3_adv FOR VALUES IN ('0001'); CREATE TABLE plt3_adv_p2 PARTITION OF plt3_adv FOR VALUES IN ('0003', '0004'); INSERT INTO plt3_adv SELECT i, i, to_char(i % 5, 'FM0000') FROM generate_series(0, 24) i WHERE i % 5 IN (1, 3, 4); ANALYZE plt3_adv; off: EXPLAIN SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.c = t2.c)) FULL JOIN plt3_adv t3 ON (t1.c = t3.c) WHERE coalesce(t1.a, 0) % 5 != 3 AND coalesce(t1.a, 0) % 5 != 4 ORDER BY t1.c, t1.a, t2.a, t3.a; QUERY PLAN ----------------------------------------------------------------------------------------------- Sort (cost=22.02..22.58 rows=223 width=27) Sort Key: t1.c, t1.a, t2.a, t3.a -> Hash Full Join (cost=4.83..13.33 rows=223 width=27) [..] with enable_partitionwise_join=ON (see the jump from cost 22.02 -> 27.65): EXPLAIN SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.c = t2.c)) FULL JOIN plt3_adv t3 ON (t1.c = t3.c) WHERE coalesce(t1.a, 0) % 5 != 3 AND coalesce(t1.a, 0) % 5 != 4 ORDER BY t1.c, t1.a, t2.a, t3.a; QUERY PLAN ----------------------------------------------------------------------------------------------- Sort (cost=27.65..28.37 rows=289 width=27) Sort Key: t1.c, t1.a, t2.a, t3.a -> Append (cost=2.23..15.83 rows=289 width=27) -> Hash Full Join (cost=2.23..4.81 rows=41 width=27) [..] -> Hash Full Join (cost=2.45..9.57 rows=248 width=27) [..] However with the patch applied the plan with minimal cost is always chosen ("22"): explain SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.c = t2.c)) FULL JOIN plt3_adv t3 ON (t1.c = t3.c) WHERE coalesce(t1.a, 0 ) % 5 != 3 AND coalesce(t1.a, 0) % 5 != 4 ORDER BY t1.c, t1.a, t2.a, t3.a; QUERY PLAN ----------------------------------------------------------------------------------------------- Sort (cost=22.02..22.58 rows=223 width=27) Sort Key: t1.c, t1.a, t2.a, t3.a -> Hash Full Join (cost=4.83..13.33 rows=223 width=27) [..] set enable_partitionwise_join to on; explain SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.c = t2.c)) FULL JOIN plt3_adv t3 ON (t1.c = t3.c) WHERE coalesce(t1.a, 0 ) % 5 != 3 AND coalesce(t1.a, 0) % 5 != 4 ORDER BY t1.c, t1.a, t2.a, t3.a; QUERY PLAN ----------------------------------------------------------------------------------------------- Sort (cost=22.02..22.58 rows=223 width=27) Sort Key: t1.c, t1.a, t2.a, t3.a -> Hash Full Join (cost=4.83..13.33 rows=223 width=27) [..] with the patch applied, the minimal cost (with toggle on or off) the cost always stays the minimal from the available ones. We cannot provide a reproducer for real performance regression, but for the affected customer it took 530+s (with enable_partitionwise_join=on) and without that GUC it it was ~23s. 4. meson test ends up with failures like below: 4/290 postgresql:regress / regress/regress ERROR 32.67s 6/290 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade ERROR 56.96s 35/290 postgresql:recovery / recovery/027_stream_regress ERROR 40.20s (all due to "regression tests pass" failures) the partition_join.sql is failing for test 206, so for this: -- partitionwise join with fractional paths CREATE TABLE fract_t (id BIGINT, PRIMARY KEY (id)) PARTITION BY RANGE (id); CREATE TABLE fract_t0 PARTITION OF fract_t FOR VALUES FROM ('0') TO ('1000'); CREATE TABLE fract_t1 PARTITION OF fract_t FOR VALUES FROM ('1000') TO ('2000'); -- insert data INSERT INTO fract_t (id) (SELECT generate_series(0, 1999)); ANALYZE fract_t; -- verify plan; nested index only scans SET max_parallel_workers_per_gather = 0; SET enable_partitionwise_join = on; the testsuite was expecting the below with enable_partitionwise_join = on; EXPLAIN (COSTS OFF) SELECT x.id, y.id FROM fract_t x LEFT JOIN fract_t y USING (id) ORDER BY x.id ASC LIMIT 10; QUERY PLAN ----------------------------------------------------------------------- Limit -> Merge Append Sort Key: x.id -> Merge Left Join Merge Cond: (x_1.id = y_1.id) -> Index Only Scan using fract_t0_pkey on fract_t0 x_1 -> Index Only Scan using fract_t0_pkey on fract_t0 y_1 -> Merge Left Join Merge Cond: (x_2.id = y_2.id) -> Index Only Scan using fract_t1_pkey on fract_t1 x_2 -> Index Only Scan using fract_t1_pkey on fract_t1 y_2 but actually with patch it gets this (here with costs): EXPLAIN (COSTS) SELECT x.id, y.id FROM fract_t x LEFT JOIN fract_t y USING (id) ORDER BY x.id ASC LIMIT 10; QUERY PLAN ------------------------------------------------------------------------------------------------------------- Limit (cost=1.10..2.21 rows=10 width=16) -> Merge Left Join (cost=1.10..223.10 rows=2000 width=16) Merge Cond: (x.id = y.id) -> Append (cost=0.55..96.55 rows=2000 width=8) [..] -> Append (cost=0.55..96.55 rows=2000 width=8) [..] if you run it without patch and again with enable_partitionwise_join=on: EXPLAIN SELECT x.id, y.id FROM fract_t x LEFT JOIN fract_t y USING (id) ORDER BY x.id ASC LIMIT 10; QUERY PLAN ------------------------------------------------------------------------------------------------------------- Limit (cost=1.11..2.22 rows=10 width=16) -> Merge Append (cost=1.11..223.11 rows=2000 width=16) Sort Key: x.id -> Merge Left Join (cost=0.55..101.55 rows=1000 width=16) [..] -> Merge Left Join (cost=0.55..101.55 rows=1000 width=16) [..] So with the patch that SQL does not use partitionwise join as it finds it more optimal to stick to a plan with cost of "1.10..2.21" instead of "1.11..2.22" (w/ partition_join), nitpicking but still a failure technically. Perhaps it could be even removed? (it's pretty close to noise?). -J.
Hi Ashutosh & hackers,
On Mon, Apr 15, 2024 at 9:00 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> Here's patch with
>
[..]
> Adding to the next commitfest but better to consider this for the next set of minor releases.
1. The patch does not pass cfbot -
https://cirrus-ci.com/task/5486258451906560 on master due to test
failure "not ok 206 + partition_join"
2. Without the patch applied, the result of the meson test on master
was clean (no failures , so master is fine). After applying patch
there were expected some hunk failures (as the patch was created for
15_STABLE):
patching file src/backend/optimizer/plan/planner.c
Hunk #1 succeeded at 7567 (offset 468 lines).
Hunk #2 succeeded at 7593 (offset 468 lines).
patching file src/test/regress/expected/partition_join.out
Hunk #1 succeeded at 4777 (offset 56 lines).
Hunk #2 succeeded at 4867 (offset 56 lines).
patching file src/test/regress/sql/partition_join.sql
Hunk #1 succeeded at 1136 (offset 1 line).
3. Without patch there is performance regression/bug on master (cost
is higher with enable_partitionwise_join=on that without it):
data preparation:
-- Test the process_outer_partition() code path
CREATE TABLE plt1_adv (a int, b int, c text) PARTITION BY LIST (c);
CREATE TABLE plt1_adv_p1 PARTITION OF plt1_adv FOR VALUES IN ('0000',
'0001', '0002');
CREATE TABLE plt1_adv_p2 PARTITION OF plt1_adv FOR VALUES IN ('0003', '0004');
INSERT INTO plt1_adv SELECT i, i, to_char(i % 5, 'FM0000') FROM
generate_series(0, 24) i;
ANALYZE plt1_adv;
CREATE TABLE plt2_adv (a int, b int, c text) PARTITION BY LIST (c);
CREATE TABLE plt2_adv_p1 PARTITION OF plt2_adv FOR VALUES IN ('0002');
CREATE TABLE plt2_adv_p2 PARTITION OF plt2_adv FOR VALUES IN ('0003', '0004');
INSERT INTO plt2_adv SELECT i, i, to_char(i % 5, 'FM0000') FROM
generate_series(0, 24) i WHERE i % 5 IN (2, 3, 4);
ANALYZE plt2_adv;
CREATE TABLE plt3_adv (a int, b int, c text) PARTITION BY LIST (c);
CREATE TABLE plt3_adv_p1 PARTITION OF plt3_adv FOR VALUES IN ('0001');
CREATE TABLE plt3_adv_p2 PARTITION OF plt3_adv FOR VALUES IN ('0003', '0004');
INSERT INTO plt3_adv SELECT i, i, to_char(i % 5, 'FM0000') FROM
generate_series(0, 24) i WHERE i % 5 IN (1, 3, 4);
ANALYZE plt3_adv;
off:
EXPLAIN SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1_adv t1
LEFT JOIN plt2_adv t2 ON (t1.c = t2.c)) FULL JOIN plt3_adv t3 ON (t1.c
= t3.c) WHERE coalesce(t1.a, 0) % 5 != 3 AND coalesce(t1.a, 0) % 5 !=
4 ORDER BY t1.c, t1.a, t2.a, t3.a;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Sort (cost=22.02..22.58 rows=223 width=27)
Sort Key: t1.c, t1.a, t2.a, t3.a
-> Hash Full Join (cost=4.83..13.33 rows=223 width=27)
[..]
with enable_partitionwise_join=ON (see the jump from cost 22.02 -> 27.65):
EXPLAIN SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1_adv t1
LEFT JOIN plt2_adv t2 ON (t1.c = t2.c)) FULL JOIN plt3_adv t3 ON (t1.c
= t3.c) WHERE coalesce(t1.a, 0) % 5 != 3 AND coalesce(t1.a, 0) % 5 !=
4 ORDER BY t1.c, t1.a, t2.a, t3.a;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Sort (cost=27.65..28.37 rows=289 width=27)
Sort Key: t1.c, t1.a, t2.a, t3.a
-> Append (cost=2.23..15.83 rows=289 width=27)
-> Hash Full Join (cost=2.23..4.81 rows=41 width=27)
[..]
-> Hash Full Join (cost=2.45..9.57 rows=248 width=27)
[..]
However with the patch applied the plan with minimal cost is always
chosen ("22"):
explain SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1_adv t1 LEFT JOIN
plt2_adv t2 ON (t1.c = t2.c)) FULL JOIN plt3_adv t3 ON (t1.c = t3.c) WHERE
coalesce(t1.a, 0 ) % 5 != 3 AND coalesce(t1.a, 0) % 5 != 4 ORDER BY
t1.c, t1.a, t2.a, t3.a;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Sort (cost=22.02..22.58 rows=223 width=27)
Sort Key: t1.c, t1.a, t2.a, t3.a
-> Hash Full Join (cost=4.83..13.33 rows=223 width=27)
[..]
set enable_partitionwise_join to on;
explain SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1_adv t1 LEFT JOIN
plt2_adv t2 ON (t1.c = t2.c)) FULL JOIN plt3_adv t3 ON (t1.c = t3.c) WHERE
coalesce(t1.a, 0 ) % 5 != 3 AND coalesce(t1.a, 0) % 5 != 4 ORDER BY
t1.c, t1.a, t2.a, t3.a;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Sort (cost=22.02..22.58 rows=223 width=27)
Sort Key: t1.c, t1.a, t2.a, t3.a
-> Hash Full Join (cost=4.83..13.33 rows=223 width=27)
[..]
with the patch applied, the minimal cost (with toggle on or off) the
cost always stays the minimal from the available ones. We cannot
provide a reproducer for real performance regression, but for the
affected customer it took 530+s (with enable_partitionwise_join=on)
and without that GUC it it was ~23s.
4. meson test ends up with failures like below:
4/290 postgresql:regress / regress/regress
ERROR 32.67s
6/290 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
ERROR 56.96s
35/290 postgresql:recovery / recovery/027_stream_regress
ERROR 40.20s
(all due to "regression tests pass" failures)
the partition_join.sql is failing for test 206, so for this:
-- partitionwise join with fractional paths
CREATE TABLE fract_t (id BIGINT, PRIMARY KEY (id)) PARTITION BY RANGE (id);
CREATE TABLE fract_t0 PARTITION OF fract_t FOR VALUES FROM ('0') TO ('1000');
CREATE TABLE fract_t1 PARTITION OF fract_t FOR VALUES FROM ('1000') TO ('2000');
-- insert data
INSERT INTO fract_t (id) (SELECT generate_series(0, 1999));
ANALYZE fract_t;
-- verify plan; nested index only scans
SET max_parallel_workers_per_gather = 0;
SET enable_partitionwise_join = on;
the testsuite was expecting the below with enable_partitionwise_join = on;
EXPLAIN (COSTS OFF)
SELECT x.id, y.id FROM fract_t x LEFT JOIN fract_t y USING (id) ORDER
BY x.id ASC LIMIT 10;
QUERY PLAN
-----------------------------------------------------------------------
Limit
-> Merge Append
Sort Key: x.id
-> Merge Left Join
Merge Cond: (x_1.id = y_1.id)
-> Index Only Scan using fract_t0_pkey on fract_t0 x_1
-> Index Only Scan using fract_t0_pkey on fract_t0 y_1
-> Merge Left Join
Merge Cond: (x_2.id = y_2.id)
-> Index Only Scan using fract_t1_pkey on fract_t1 x_2
-> Index Only Scan using fract_t1_pkey on fract_t1 y_2
but actually with patch it gets this (here with costs):
EXPLAIN (COSTS) SELECT x.id, y.id FROM fract_t x LEFT JOIN fract_t y
USING (id) ORDER BY x.id ASC LIMIT 10;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Limit (cost=1.10..2.21 rows=10 width=16)
-> Merge Left Join (cost=1.10..223.10 rows=2000 width=16)
Merge Cond: (x.id = y.id)
-> Append (cost=0.55..96.55 rows=2000 width=8)
[..]
-> Append (cost=0.55..96.55 rows=2000 width=8)
[..]
if you run it without patch and again with enable_partitionwise_join=on:
EXPLAIN SELECT x.id, y.id FROM fract_t x LEFT JOIN fract_t y USING
(id) ORDER BY x.id ASC LIMIT 10;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Limit (cost=1.11..2.22 rows=10 width=16)
-> Merge Append (cost=1.11..223.11 rows=2000 width=16)
Sort Key: x.id
-> Merge Left Join (cost=0.55..101.55 rows=1000 width=16)
[..]
-> Merge Left Join (cost=0.55..101.55 rows=1000 width=16)
[..]
So with the patch that SQL does not use partitionwise join as it finds
it more optimal to stick to a plan with cost of "1.10..2.21" instead
of "1.11..2.22" (w/ partition_join), nitpicking but still a failure
technically. Perhaps it could be even removed? (it's pretty close to
noise?).
On Mon, May 6, 2024 at 4:26 PM Jakub Wartak <jakub.wartak@enterprisedb.com> wrote:Hi Ashutosh & hackers,
On Mon, Apr 15, 2024 at 9:00 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> Here's patch with
>
[..]
> Adding to the next commitfest but better to consider this for the next set of minor releases.
1. The patch does not pass cfbot -
https://cirrus-ci.com/task/5486258451906560 on master due to test
failure "not ok 206 + partition_join"So I need to create a patch for master first. I thought CFBot somehow knew that the patch was created for PG 15. :)
4. meson test ends up with failures like below:
4/290 postgresql:regress / regress/regress
ERROR 32.67s
6/290 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
ERROR 56.96s
35/290 postgresql:recovery / recovery/027_stream_regress
ERROR 40.20s
(all due to "regression tests pass" failures)
[...]
So with the patch that SQL does not use partitionwise join as it finds
it more optimal to stick to a plan with cost of "1.10..2.21" instead
of "1.11..2.22" (w/ partition_join), nitpicking but still a failure
technically. Perhaps it could be even removed? (it's pretty close to
noise?).
Вложения
Hi Ashutosh,
thanks for bringing this to my attention. I'll first share a few
thoughts about the change and respond regarding the test below.
I clearly understand your intention with this patch. It's an issue I run
into from time to time.
I did some testing with some benchmark sets back with pg 14. I did the
following: I planned with and without the partitionwise join GUC
(explain) and took the one with the lower cost to execute the query.
Interestingly, even discounting the overhead and additional planning
time, the option with the lower cost turned out to be slower on our
benchmark set back then. The median query with disabled GUC was quicker,
but on average that was not the case. The observation is one, I'd
generally describe as "The more options you consider, the more ways we
have to be horribly wrong. More options for the planner are a great way
to uncover the various shortcomings of it."
That might be specific to the benchmark I was working with at the time.
But that made me drop the issue back then. That is ofc no valid reason
not to go in the direction of making the planner to consider more
options. :)
Maybe we can discuss that in person next week?
On 2024-05-22 07:57, Ashutosh Bapat wrote:
> On Mon, May 6, 2024 at 6:28 PM Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
>>> 4. meson test ends up with failures like below:
>>>
>>> 4/290 postgresql:regress / regress/regress
>>> ERROR 32.67s
>>> 6/290 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
>>> ERROR 56.96s
>>> 35/290 postgresql:recovery / recovery/027_stream_regress
>>> ERROR 40.20s
>>>
>>> (all due to "regression tests pass" failures)
>>> [...]
>
>>> So with the patch that SQL does not use partitionwise join as it
>>> finds
>>> it more optimal to stick to a plan with cost of "1.10..2.21"
>>> instead
>>> of "1.11..2.22" (w/ partition_join), nitpicking but still a
>>> failure
>>> technically. Perhaps it could be even removed? (it's pretty close
>>> to
>>> noise?).
>
> The test was added by 6b94e7a6da2f1c6df1a42efe64251f32a444d174 and
> later modified by 3c569049b7b502bb4952483d19ce622ff0af5fd6. The
> modification just avoided eliminating the join, so that change can be
> ignored. 6b94e7a6da2f1c6df1a42efe64251f32a444d174 added the tests to
> test fractional paths being considered when creating ordered append
> paths. Reading the commit message, I was expecting a test which did
> not use a join as well and also which used inheritance. But it seems
> that the queries added by that commit, test all the required scenarios
> and hence we see two queries involving join between partitioned
> tables. As the comment there says the intention is to verify index
> only scans and not exactly partitionwise join. So just fixing the
> expected output of one query looks fine. The other query will test
> partitionwise join and fractional paths anyway. I am including Tomas,
> Arne and Zhihong, who worked on the first commit, to comment on
> expected output changes.
The test was put there to make sure a fractional join is considered in
the case that a partitionwise join is considered. Because that wasn't
the case before.
The important part for my use case back then was that we do Merge
Join(s) at all. The test result after your patch still confirms that.
If we simply modify the test as such, we no longer confirm, whether the
code path introduced in 6b94e7a6da2f1c6df1a42efe64251f32a444d174 is
still working.
Maybe it's worthwhile to add something like
create index on fract_t0 ((id*id));
EXPLAIN (COSTS OFF)
SELECT * FROM fract_t x JOIN fract_t y USING (id) ORDER BY id * id DESC
LIMIT 10;
QUERY PLAN
-------------------------------------------------------------------------------
Limit
-> Merge Append
Sort Key: ((x.id * x.id)) DESC
-> Nested Loop
-> Index Scan Backward using fract_t0_expr_idx on
fract_t0 x_1
-> Index Only Scan using fract_t0_pkey on fract_t0 y_1
Index Cond: (id = x_1.id)
-> Sort
Sort Key: ((x_2.id * x_2.id)) DESC
-> Hash Join
Hash Cond: (x_2.id = y_2.id)
-> Seq Scan on fract_t1 x_2
-> Hash
-> Seq Scan on fract_t1 y_2
I am not sure, whether it's worth the extra test cycles on every animal,
but since we are not creating an extra table it might be ok.
I don't have a very strong feeling about the above test case.
> I will create patches for the back-branches once the patch for master
> is in a committable state.
I am not sure, whether it's really a bug. I personally wouldn't be brave
enough to back patch this. I don't want to deal with complaining end
users. Suddenly their optimizer, which always had horrible estimates,
was actually able to do harmful stuff with them. Only due to a minor
version upgrade. I think that's a bad idea to backpatch something with
complex performance implications. Especially since they might even be
based on potentially inaccurate data...
>
> --
>
> Best Wishes,
> Ashutosh Bapat
All the best
Arne
On Fri, May 24, 2024 at 2:02 PM <arne.roland@malkut.net> wrote: > I am not sure, whether it's really a bug. I personally wouldn't be brave > enough to back patch this. I don't want to deal with complaining end > users. Suddenly their optimizer, which always had horrible estimates, > was actually able to do harmful stuff with them. Only due to a minor > version upgrade. I think that's a bad idea to backpatch something with > complex performance implications. Especially since they might even be > based on potentially inaccurate data... +1. -- Robert Haas EDB: http://www.enterprisedb.com
Hi Ashutosh,
thanks for bringing this to my attention. I'll first share a few
thoughts about the change and respond regarding the test below.
I clearly understand your intention with this patch. It's an issue I run
into from time to time.
I did some testing with some benchmark sets back with pg 14. I did the
following: I planned with and without the partitionwise join GUC
(explain) and took the one with the lower cost to execute the query.
Interestingly, even discounting the overhead and additional planning
time, the option with the lower cost turned out to be slower on our
benchmark set back then. The median query with disabled GUC was quicker,
but on average that was not the case. The observation is one, I'd
generally describe as "The more options you consider, the more ways we
have to be horribly wrong. More options for the planner are a great way
to uncover the various shortcomings of it."
That might be specific to the benchmark I was working with at the time.
But that made me drop the issue back then. That is ofc no valid reason
not to go in the direction of making the planner to consider more
options. :)
Maybe we can discuss that in person next week?
On 2024-05-22 07:57, Ashutosh Bapat wrote:
>
> The test was added by 6b94e7a6da2f1c6df1a42efe64251f32a444d174 and
> later modified by 3c569049b7b502bb4952483d19ce622ff0af5fd6. The
> modification just avoided eliminating the join, so that change can be
> ignored. 6b94e7a6da2f1c6df1a42efe64251f32a444d174 added the tests to
> test fractional paths being considered when creating ordered append
> paths. Reading the commit message, I was expecting a test which did
> not use a join as well and also which used inheritance. But it seems
> that the queries added by that commit, test all the required scenarios
> and hence we see two queries involving join between partitioned
> tables. As the comment there says the intention is to verify index
> only scans and not exactly partitionwise join. So just fixing the
> expected output of one query looks fine. The other query will test
> partitionwise join and fractional paths anyway. I am including Tomas,
> Arne and Zhihong, who worked on the first commit, to comment on
> expected output changes.
The test was put there to make sure a fractional join is considered in
the case that a partitionwise join is considered. Because that wasn't
the case before.
The important part for my use case back then was that we do Merge
Join(s) at all. The test result after your patch still confirms that.
If we simply modify the test as such, we no longer confirm, whether the
code path introduced in 6b94e7a6da2f1c6df1a42efe64251f32a444d174 is
still working.
Maybe it's worthwhile to add something like
create index on fract_t0 ((id*id));
EXPLAIN (COSTS OFF)
SELECT * FROM fract_t x JOIN fract_t y USING (id) ORDER BY id * id DESC
LIMIT 10;
QUERY PLAN
-------------------------------------------------------------------------------
Limit
-> Merge Append
Sort Key: ((x.id * x.id)) DESC
-> Nested Loop
-> Index Scan Backward using fract_t0_expr_idx on
fract_t0 x_1
-> Index Only Scan using fract_t0_pkey on fract_t0 y_1
Index Cond: (id = x_1.id)
-> Sort
Sort Key: ((x_2.id * x_2.id)) DESC
-> Hash Join
Hash Cond: (x_2.id = y_2.id)
-> Seq Scan on fract_t1 x_2
-> Hash
-> Seq Scan on fract_t1 y_2
I am not sure, whether it's worth the extra test cycles on every animal,
but since we are not creating an extra table it might be ok.
I don't have a very strong feeling about the above test case.
> I will create patches for the back-branches once the patch for master
> is in a committable state.
I am not sure, whether it's really a bug. I personally wouldn't be brave
enough to back patch this. I don't want to deal with complaining end
users. Suddenly their optimizer, which always had horrible estimates,
was actually able to do harmful stuff with them. Only due to a minor
version upgrade. I think that's a bad idea to backpatch something with
complex performance implications. Especially since they might even be
based on potentially inaccurate data...
Hi Ashutosh! On 2024-05-27 14:17, Ashutosh Bapat wrote: > On Fri, May 24, 2024 at 11:02 AM <arne.roland@malkut.net> wrote: > >> Hi Ashutosh, >> >> thanks for bringing this to my attention. I'll first share a few >> thoughts about the change and respond regarding the test below. >> >> I clearly understand your intention with this patch. It's an issue I >> run >> into from time to time. >> >> I did some testing with some benchmark sets back with pg 14. I did >> the >> following: I planned with and without the partitionwise join GUC >> (explain) and took the one with the lower cost to execute the query. >> >> Interestingly, even discounting the overhead and additional planning >> >> time, the option with the lower cost turned out to be slower on our >> benchmark set back then. The median query with disabled GUC was >> quicker, >> but on average that was not the case. The observation is one, I'd >> generally describe as "The more options you consider, the more ways >> we >> have to be horribly wrong. More options for the planner are a great >> way >> to uncover the various shortcomings of it." >> >> That might be specific to the benchmark I was working with at the >> time. >> But that made me drop the issue back then. That is ofc no valid >> reason >> not to go in the direction of making the planner to consider more >> options. :) > > In summary, you are suggesting that partitionwise join performs better > than plain join even if the latter one has lower cost. Hence fixing > this issue has never become a priority for you. Am I right? > > Plans with lower costs being slower is not new for optimizer. > Partitionwise join just adds another case. Sorry for my confusing long text. I will try to recap my points concisely. 1. I think the order by pk frac limit plans had just to similar performance behaviour for me to bother. But afaics the main point of your proposal is not related to frac plans at all. 2. We can't expect the optimizers to simply yield better results by being given more options to be wrong. (Let me give a simple example: This patch makes our lack of cross table cross column statistics worse. We give it more opportunity to pick something horrible. 3. I dislike, that this patch makes much harder to debug, why no partitionwise join is chosen. > >> Maybe we can discuss that in person next week? > > Sure. > >> On 2024-05-22 07:57, Ashutosh Bapat wrote: >>> >>> The test was added by 6b94e7a6da2f1c6df1a42efe64251f32a444d174 and >>> later modified by 3c569049b7b502bb4952483d19ce622ff0af5fd6. The >>> modification just avoided eliminating the join, so that change can >> be >>> ignored. 6b94e7a6da2f1c6df1a42efe64251f32a444d174 added the tests >> to >>> test fractional paths being considered when creating ordered >> append >>> paths. Reading the commit message, I was expecting a test which >> did >>> not use a join as well and also which used inheritance. But it >> seems >>> that the queries added by that commit, test all the required >> scenarios >>> and hence we see two queries involving join between partitioned >>> tables. As the comment there says the intention is to verify index >>> only scans and not exactly partitionwise join. So just fixing the >>> expected output of one query looks fine. The other query will test >>> partitionwise join and fractional paths anyway. I am including >> Tomas, >>> Arne and Zhihong, who worked on the first commit, to comment on >>> expected output changes. >> >> The test was put there to make sure a fractional join is considered >> in >> the case that a partitionwise join is considered. Because that >> wasn't >> the case before. >> >> The important part for my use case back then was that we do Merge >> Join(s) at all. The test result after your patch still confirms >> that. >> >> If we simply modify the test as such, we no longer confirm, whether >> the >> code path introduced in 6b94e7a6da2f1c6df1a42efe64251f32a444d174 is >> still working. >> >> Maybe it's worthwhile to add something like >> >> create index on fract_t0 ((id*id)); >> >> EXPLAIN (COSTS OFF) >> SELECT * FROM fract_t x JOIN fract_t y USING (id) ORDER BY id * id >> DESC >> LIMIT 10; >> QUERY PLAN >> > ------------------------------------------------------------------------------- >> Limit >> -> Merge Append >> Sort Key: ((x.id [1] * x.id [1])) DESC >> -> Nested Loop >> -> Index Scan Backward using fract_t0_expr_idx on >> fract_t0 x_1 >> -> Index Only Scan using fract_t0_pkey on fract_t0 >> y_1 >> Index Cond: (id = x_1.id [2]) >> -> Sort >> Sort Key: ((x_2.id [3] * x_2.id [3])) DESC >> -> Hash Join >> Hash Cond: (x_2.id [3] = y_2.id [4]) >> -> Seq Scan on fract_t1 x_2 >> -> Hash >> -> Seq Scan on fract_t1 y_2 >> >> I am not sure, whether it's worth the extra test cycles on every >> animal, >> but since we are not creating an extra table it might be ok. >> I don't have a very strong feeling about the above test case. > > My patch removes redundant enable_partitionwise_join = on since that's > done very early in the test. Apart from that it does not change the > test. So if the expected output change is fine with you, I think we > should leave the test as is. Plan outputs are sometimes fragile and > thus make expected outputs flaky. If at all, we can add to that. That would indeed give us more code test coverage. I will refrain from commenting further, since that discussion would get completely disconnected from the patch at hand. > >>> I will create patches for the back-branches once the patch for >> master >>> is in a committable state. >> >> I am not sure, whether it's really a bug. I personally wouldn't be >> brave >> enough to back patch this. I don't want to deal with complaining end >> >> users. Suddenly their optimizer, which always had horrible >> estimates, >> was actually able to do harmful stuff with them. Only due to a minor >> >> version upgrade. I think that's a bad idea to backpatch something >> with >> complex performance implications. Especially since they might even >> be >> based on potentially inaccurate data... > > Since it's a thinko I considered it as a bug. But I agree that it has > the potential to disturb plans after upgrade and thus upset users. So > I am fine if we don't backpatch. > > -- > > Best Wishes, > Ashutosh Bapat > > > Links: > ------ > [1] http://x.id > [2] http://x_1.id > [3] http://x_2.id > [4] http://y_2.id All the best Arne
1. I think the order by pk frac limit plans had just to similar
performance behaviour for me to bother.
But afaics the main point of your proposal is not related to frac plans
at all.
2. We can't expect the optimizers to simply yield better results by
being given more options to be wrong. (Let me give a simple example:
This patch makes our lack of cross table cross column statistics worse.
We give it more opportunity to pick something horrible.
3. I dislike, that this patch makes much harder to debug, why no
partitionwise join is chosen.
On Wed, May 22, 2024 at 3:57 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > I will create patches for the back-branches once the patch for master is in a committable state. AFAIU, this patch prevents apply_scanjoin_target_to_paths() from discarding old paths of partitioned joinrels. Therefore, we can retain non-partitionwise join paths if the cheapest path happens to be among them. One concern from me is that if the cheapest path of a joinrel is a partitionwise join path, following this approach could lead to undesirable cross-platform plan variations, as detailed in the original comment. Is there a specific query that demonstrates benefits from this change? I'm curious about scenarios where a partitionwise join runs slower than a non-partitionwise join. Thanks Richard
On Wed, Jul 24, 2024 at 9:42 AM Richard Guo <guofenglinux@gmail.com> wrote: > > On Wed, May 22, 2024 at 3:57 PM Ashutosh Bapat > <ashutosh.bapat.oss@gmail.com> wrote: > > I will create patches for the back-branches once the patch for master is in a committable state. > > AFAIU, this patch prevents apply_scanjoin_target_to_paths() from > discarding old paths of partitioned joinrels. Therefore, we can > retain non-partitionwise join paths if the cheapest path happens to be > among them. Right. Thanks for the summary. > > One concern from me is that if the cheapest path of a joinrel is a > partitionwise join path, following this approach could lead to > undesirable cross-platform plan variations, as detailed in the > original comment. I read through the email thread [3] referenced in the commit (1d338584062b3e53b738f987ecb0d2b67745232a) which added that comment. The change is mentioned in [4] first. Please notice that this change is unrelated to the bug that started the thread. [5], [6] talk about the costs of projection path above Append vs project path below Append. But I don't see any example of any cross-platform plan variations. I also do not see an example in that thread where such a plan variation results in bad performance. If the costs of partitionwise and non-partitionwise join paths are so close to each other that platform specific arithmetic can swing it one way or the other, possibly their performance is going to be comparable. Without an example query it's hard to assess this possibility or address the concern, especially when we have examples of the behaviour otherwise. > > Is there a specific query that demonstrates benefits from this change? > I'm curious about scenarios where a partitionwise join runs slower > than a non-partitionwise join. [1] provides a testcase where a nonpartitionwise join is better than partitionwise join. This testcase is derived from a bug reported by an EDB customer. [2] is another bug report on psql-bugs. [1] https://www.postgresql.org/message-id/CAKZiRmyaFFvxyEYGG_hu0F-EVEcqcnveH23MULhW6UY_jwykGw%40mail.gmail.com [2] https://www.postgresql.org/message-id/flat/786.1565541557%40sss.pgh.pa.us#9d50e1b375201f29bbf17072d75569e3 [3] https://www.postgresql.org/message-id/flat/15669-02fb3296cca26203%40postgresql.org [4] https://www.postgresql.org/message-id/20477.1551819776%40sss.pgh.pa.us [5] https://www.postgresql.org/message-id/15350.1551973953%40sss.pgh.pa.us [6] https://www.postgresql.org/message-id/24357.1551984010%40sss.pgh.pa.us -- Best Wishes, Ashutosh Bapat
On 24/7/2024 15:22, Ashutosh Bapat wrote: > On Wed, Jul 24, 2024 at 9:42 AM Richard Guo <guofenglinux@gmail.com> wrote: >> Is there a specific query that demonstrates benefits from this change? >> I'm curious about scenarios where a partitionwise join runs slower >> than a non-partitionwise join. > > [1] provides a testcase where a nonpartitionwise join is better than > partitionwise join. This testcase is derived from a bug reported by an > EDB customer. [2] is another bug report on psql-bugs. I haven't passed through the patch yet, but can this issue affect the decision on what to push down to foreign servers: a whole join or just a scan of two partitions? If the patch is related to the pushdown decision, I'd say it is quite an annoying problem for me. From time to time, I see cases where JOIN produces more tuples than both partitions have in total - in this case, it would be better to transfer tables' tuples to the main instance before joining them. -- regards, Andrei Lepikhov
On Tue, Oct 1, 2024 at 3:22 AM Andrei Lepikhov <lepihov@gmail.com> wrote: > > On 24/7/2024 15:22, Ashutosh Bapat wrote: > > On Wed, Jul 24, 2024 at 9:42 AM Richard Guo <guofenglinux@gmail.com> wrote: > >> Is there a specific query that demonstrates benefits from this change? > >> I'm curious about scenarios where a partitionwise join runs slower > >> than a non-partitionwise join. > > > > [1] provides a testcase where a nonpartitionwise join is better than > > partitionwise join. This testcase is derived from a bug reported by an > > EDB customer. [2] is another bug report on psql-bugs. > I haven't passed through the patch yet, but can this issue affect the > decision on what to push down to foreign servers: a whole join or just a > scan of two partitions? > If the patch is related to the pushdown decision, I'd say it is quite an > annoying problem for me. From time to time, I see cases where JOIN > produces more tuples than both partitions have in total - in this case, > it would be better to transfer tables' tuples to the main instance > before joining them. Sorry for replying late. I somehow didn't notice this. A join between partitions is pushed down if only partitionwise join is chosen and a join between partitions won't be pushed down if partitionwise join is not chosen. Hence this bug affects pushdown as well. The CF entry shows as waiting for author. But that isn't the right status. Will change it to needs review. I think we need a consensus as to whether we want to fix this bug or not. Since this bug doesn't affect me anymore, I will just withdraw this CF entry if there is no interest. -- Best Wishes, Ashutosh Bapat
On Thu, Jan 2, 2025 at 4:41 AM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > A join between partitions is pushed down if only partitionwise join is > chosen and a join between partitions won't be pushed down if > partitionwise join is not chosen. Hence this bug affects pushdown as > well. > > The CF entry shows as waiting for author. But that isn't the right > status. Will change it to needs review. I think we need a consensus as > to whether we want to fix this bug or not. Since this bug doesn't > affect me anymore, I will just withdraw this CF entry if there is no > interest. I think it's unhelpful that you keep calling this a "bug" when the behavior is clearly deliberate. Whether it's the *right* behavior is debatable, but it's not *accidental* behavior. I don't actually have a clear understanding of why we need this. In https://www.postgresql.org/message-id/CAKZiRmyaFFvxyEYGG_hu0F-EVEcqcnveH23MULhW6UY_jwykGw%40mail.gmail.com Jakub says that an EDB customer experienced a case where the partitionwise plan took 530+s and the non-partitionwise plan took 23s, but unfortunately there's no public test case, and in the examples shared publicly, either the partionwise plan is actually slower but is mistakenly estimated to be faster, or the two are extremely close to the same speed so it doesn't really matter. So the customer scenario (which is not public) is justification for a code-change, but the publicly-posted examples, as far as I can see, are not. And what confuses me is -- what could that test case possibly look like? I mean, how can it be less efficient to perform N small joins than 1 big join? For instance, suppose we're doing a merge join between A and B (partitions A_1..A_N and B_1..B_N) and we have to sort all the data. With a partitionwise join, we have to do 2N sorts of partitions of some size, let's say K. The cost should be O(2N * K lg K). If we instead do two really big sorts, the cost is now O(2 * (NK) lg (NK)), which is more. If we do a hash join, the cost should be about the same either way, because probing a hash table is roughly constant time. However, if we do N small hash joins, the hash table is a lot more likely to fit in memory -- and if the big hash table does not fit in memory and the small hash tables do, we should win big. Finally, let's say we're doing a nested loop. If the inner side of the nested loop is unparameterized, then the cost of the non-partitionwise nested loop is O(N^2 * K^2), while the cost of the partitionwise nested loop is O(N * K^2), which is a huge win. If the inner side is parameterized, then the partitionwise plan involves scanning one partition for matching values for each outer row, whereas the non-partitionwise plan involves scanning every partition for matching values for each outer row, which is again clearly worse. I'm obviously missing something here, because I'm sure Jakub is quite right when he says that this actually happened and actually hosed an EDB customer. But I don't understand HOW it happened, and I think if we're going to change the code we really ought to understand that and write some code comments about it. In general, I think that it's very reasonable to expect that a bunch of small joins will beat one big join, which is why the code does what it currently does. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes:
> I'm obviously missing something here, because I'm sure Jakub is quite
> right when he says that this actually happened and actually hosed an
> EDB customer. But I don't understand HOW it happened, and I think if
> we're going to change the code we really ought to understand that and
> write some code comments about it. In general, I think that it's very
> reasonable to expect that a bunch of small joins will beat one big
> join, which is why the code does what it currently does.
I am wondering if the problem is not that the plan is slower, it's
that for some reason the planner took a lot longer to create it.
It's very plausible that partitionwise planning takes longer, and
maybe we have some corner cases where the time is O(N^2) or worse.
However, this is pure speculation without a test case, and any
proposed fix would be even more speculative. I concur with your
bottom line: we should insist on a public test case before deciding
what to do about it.
regards, tom lane
On Thu, Jan 2, 2025 at 3:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I am wondering if the problem is not that the plan is slower, it's > that for some reason the planner took a lot longer to create it. > It's very plausible that partitionwise planning takes longer, and > maybe we have some corner cases where the time is O(N^2) or worse. That doesn't seem like a totally unreasonable speculation, but it seems a little surprising that retaining the non-partitionwise paths would fix it. True, that might let us discard a bunch of partitionwise paths more quickly than would otherwise be possible, but I wouldn't expect that to have an impact as dramatic as what Jakub alleged. The thing I thought about was whether there might be some weird effects with lots of empty partitions; or maybe with some other property of the path like say sort keys or parallelism. For example if we couldn't generate a partitionwise path with sort keys as good as the non-partitionwise path had, or if we couldn't generate a parallel partitionwise path but we could generate a parallel non-partitionwise path. As far as I knew neither of those things are real problems, but if they were then I believe they could pretty easily explain a large regression. > However, this is pure speculation without a test case, and any > proposed fix would be even more speculative. I concur with your > bottom line: we should insist on a public test case before deciding > what to do about it. Yeah. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Jan 3, 2025 at 3:02 AM Robert Haas <robertmhaas@gmail.com> wrote: > > I think it's unhelpful that you keep calling this a "bug" when the > behavior is clearly deliberate. Whether it's the *right* behavior is > debatable, but it's not *accidental* behavior. > Ok, let's call it "not right" behaviour :). Let me further expand on the explanation in my first email [1]. After the planner has added all possible paths, apply_scanjoin_target_to_paths(), which should be just adjusting their targets, zaps them all. That looks weird. But the comment explains why its doing so. -- quote comment * This function would still be correct if we kept the * existing paths: we'd modify them to generate the correct target above * the partitioning Append, and then they'd compete on cost with paths * generating the target below the Append. However, in our current cost * model the latter way is always the same or cheaper cost, so modifying * the existing paths would just be useless work. Moreover, when the cost * is the same, varying roundoff errors might sometimes allow an existing * path to be picked, resulting in undesirable cross-platform plan * variations. -- The comment mentions only "partitioning Append" and not "Join" paths. A simple partitioned relation's pathlist contains only append paths but a partitioned join relation's pathlist contains join paths as well as append (of join) paths. What the comment says is true for both Append of scans as well as Append of joins, but not for join paths joining append paths - non-partition wise paths. In such paths we should be adjusting only the target of the join path and not that of the paths under the Append. The comment does not mention Join over append at all. The code discards both join as well as append paths but rebuilds only append paths. There is no comment in the function explaining why we omit join of append paths. So the code does not seem intentional to me as far as partitionwise joins are concerned. Losing a costwise optimal path doesn't seem to be something that should happen while adjusting path targets (unless while adjusting path targets we find a lower cost path, but we aren't keeping the old paths around for comparison.) > On Thu, Jan 2, 2025 at 3:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I am wondering if the problem is not that the plan is slower, it's > > that for some reason the planner took a lot longer to create it. > > It's very plausible that partitionwise planning takes longer, and > > maybe we have some corner cases where the time is O(N^2) or worse. > > That doesn't seem like a totally unreasonable speculation, but it > seems a little surprising that retaining the non-partitionwise paths > would fix it. True, that might let us discard a bunch of partitionwise > paths more quickly than would otherwise be possible, but I wouldn't > expect that to have an impact as dramatic as what Jakub alleged. The > thing I thought about was whether there might be some weird effects > with lots of empty partitions; or maybe with some other property of > the path like say sort keys or parallelism. For example if we couldn't > generate a partitionwise path with sort keys as good as the > non-partitionwise path had, or if we couldn't generate a parallel > partitionwise path but we could generate a parallel non-partitionwise > path. As far as I knew neither of those things are real problems, but > if they were then I believe they could pretty easily explain a large > regression. > > > However, this is pure speculation without a test case, and any > > proposed fix would be even more speculative. I concur with your > > bottom line: we should insist on a public test case before deciding > > what to do about it. > That's a valid ask. AFAIR, it's a quite tricky scenario. Both Jakub and myself have tried but could not reproduce the issue. Let me mark this CF entry as returned with feedback and resurrect it when we have a reproduction. [1] https://www.postgresql.org/message-id/CAExHW5toze58+jL-454J3ty11sqJyU13Sz5rJPQZDmASwZgWiA@mail.gmail.com -- Best Wishes, Ashutosh Bapat
On Thu, Jan 2, 2025 at 3:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I am wondering if the problem is not that the plan is slower, it's
> that for some reason the planner took a lot longer to create it.
> It's very plausible that partitionwise planning takes longer, and
> maybe we have some corner cases where the time is O(N^2) or worse.
>
> However, this is pure speculation without a test case, and any
> proposed fix would be even more speculative. I concur with your
> bottom line: we should insist on a public test case before deciding
> what to do about it.
I had the humbling experience of rediscovering this problem today and
then, shortly thereafter, realizing that Ashutosh's originally
explanation of the problem was correct and that I simply failed to
understand it at the time. Consider this query from the regression
tests, which Ashutosh's patch adjusts:
SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a
AND t1.a = t2.b ORDER BY t1.a, t2.b;
Right now, the regression tests get this plan:
Sort (cost=11.52..11.53 rows=3 width=18)
Sort Key: t1.a
-> Append (cost=2.20..11.50 rows=3 width=18)
-> Merge Join (cost=2.20..3.58 rows=1 width=18)
Merge Cond: (t1_1.a = t2_1.a)
-> Index Scan using iprt1_p1_a on prt1_p1 t1_1
(cost=0.14..14.02 rows=125 width=9)
-> Sort (cost=2.06..2.06 rows=1 width=13)
Sort Key: t2_1.b
-> Seq Scan on prt2_p1 t2_1 (cost=0.00..2.05
rows=1 width=13)
Filter: (a = b)
-> Hash Join (cost=2.05..4.78 rows=1 width=18)
Hash Cond: (t1_2.a = t2_2.a)
-> Seq Scan on prt1_p2 t1_2 (cost=0.00..2.25 rows=125 width=9)
-> Hash (cost=2.04..2.04 rows=1 width=13)
-> Seq Scan on prt2_p2 t2_2 (cost=0.00..2.04
rows=1 width=13)
Filter: (a = b)
-> Hash Join (cost=1.43..3.12 rows=1 width=18)
Hash Cond: (t1_3.a = t2_3.a)
-> Seq Scan on prt1_p3 t1_3 (cost=0.00..1.50 rows=50 width=9)
-> Hash (cost=1.41..1.41 rows=1 width=13)
-> Seq Scan on prt2_p3 t2_3 (cost=0.00..1.41
rows=1 width=13)
Filter: (a = b)
But if you disable paritionwise join for this query, you get this plan:
Merge Join (cost=5.99..7.99 rows=3 width=18)
Merge Cond: (t1.a = t2.a)
-> Merge Append (cost=0.45..44.83 rows=300 width=9)
Sort Key: t1.a
-> Index Scan using iprt1_p1_a on prt1_p1 t1_1
(cost=0.14..14.02 rows=125 width=9)
-> Index Scan using iprt1_p2_a on prt1_p2 t1_2
(cost=0.14..14.02 rows=125 width=9)
-> Index Scan using iprt1_p3_a on prt1_p3 t1_3
(cost=0.14..12.89 rows=50 width=9)
-> Sort (cost=5.54..5.55 rows=3 width=13)
Sort Key: t2.b
-> Append (cost=0.00..5.51 rows=3 width=13)
-> Seq Scan on prt2_p1 t2_1 (cost=0.00..2.05 rows=1 width=13)
Filter: (a = b)
-> Seq Scan on prt2_p2 t2_2 (cost=0.00..2.04 rows=1 width=13)
Filter: (a = b)
-> Seq Scan on prt2_p3 t2_3 (cost=0.00..1.41 rows=1 width=13)
Filter: (a = b)
This demonstrates that, in this case, setting
enable_partitionwise_join = on causes the planner to switch to a plan
with a higher total cost, which obviously shouldn't happen. It's still
unclear to me how this can ever cause a slowdown of the magnitude that
is reported to have happened in an EDB customer scenario, but I think
this example is sufficient to prove that the current behavior is
fundamentally incorrect. The issue is that when
apply_scanjoin_target_to_paths() sets the pathlist to NIL, it does so
on the assumption that all the same paths will be re-added afterwards,
except with the correct scan/join target. But for partitionwise join
relations, this is not true: any Append and MergeAppend paths will be
re-added afterwards, but any non-partitionwise joins are only added
before the pathlist is zapped, and therefore the planner loses the
ability to consider them when the pathlist is reset. Said differently,
the existing comment would be correct if *all* of the paths for the
rel had to be Append/MergeAppend paths, which is true for a
partitioned baserel (I think) but untrue for a partitioned joinrel
(for sure).
I haven't had a chance just yet to think through all the details of
the proposed patch, but I now believe we should commit something along
those lines. I still suspect that back-patching is unwise; even though
I now agree with Ashutosh's claim that this is a bug, because previous
experience with destabilizing plans in back-branches has not been
good. Hence, I'm inclined to fix only master. I do think the comments
in the patch need some work, and I plan to tackle that tomorrow.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Mon, Oct 27, 2025 at 5:12 PM Robert Haas <robertmhaas@gmail.com> wrote: > I haven't had a chance just yet to think through all the details of > the proposed patch, but I now believe we should commit something along > those lines. I still suspect that back-patching is unwise; even though > I now agree with Ashutosh's claim that this is a bug, because previous > experience with destabilizing plans in back-branches has not been > good. Hence, I'm inclined to fix only master. I do think the comments > in the patch need some work, and I plan to tackle that tomorrow. It seems that, in the time sense this patch was originally posted, it's been side-swiped by Richard Guo's commits 24225ad9aafc and 9b282a9359a1, with the result that the regression tests now fail with the patch applied, and I'm not immediately certain how to clean that up. I'm also not sure that the way the patch handles the test cases it did adjust is optimal. Here is some preliminary analysis; opinions appreciated. With the patch as last posted applied, I see three regression test failures. The first one is for this query: explain (verbose, costs off) select * from unique_tbl_p t1, unique_tbl_p t2 where (t1.a, t2.a) in (select a, a from unique_tbl_p t3) order by t1.a, t2.a; The second is for this query: EXPLAIN (COSTS OFF) SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b; And the third one is for this query: EXPLAIN (COSTS OFF) SELECT t1.* FROM prt1 t1 WHERE t1.a IN (SELECT t1.b FROM prt2 t1 WHERE t1.b IN (SELECT (t1.a + t1.b)/2 FROM prt1_e t1 WHERE t1.c = 0)) AND t1.b = 0 ORDER BY t1.a; Without the patch, all of the queries perform full partitionwise joins, between three tables in the first and third cases and between two tables (since that's all there are) in the second case. With the patch, the first and third cases switch to performing one of the joins partitionwise and one of the joins non-partitionwise, and the second case switches to performing a non-partitionwise join. I think this defeats the point of the test cases, so it probably needs to be fixed somehow, but I'm not sure what kind of adjustment would be most appropriate. I tried adjusting these three queries to use COSTS ON, and compared the costs with and without the fix: Q1 fixed: Merge Join (cost=9.98..1417.93 rows=83503 width=16) Q1 unfixed: Merge Append (cost=10.01..2283.64 rows=83503 width=16) Q2 fixed: Merge Join (cost=5.96..7.86 rows=3 width=18) Q2 unfixed: Sort (cost=11.52..11.53 rows=3 width=18) Q3 fixed: Merge Join (cost=25.09..27.47 rows=1 width=13) Q3 unfixed: Merge Append (cost=24.94..27.27 rows=3 width=13) What we see here is that, in the case of Q1, the fix reduces the cost by a large amount, which is the kind of thing you'd hope would happen when you fix a costing a bug, although I haven't quite figured out why we get such a large benefit. Likewise, in the second case, the cost goes down with the fix, although not by a lot. That case is interesting because the plan selected with the patch is a merge join of appends of index scans, which of every possible plan shape is probably the one that benefits least from being performed partitionwise. If the merge join involved a sort, you'd expect the partitionwise approach to win, since several smaller sorts figure to cost less in total than one big sort; but here it doesn't, so there's little room for the partitionwise nature of the operation to provide a benefit, and apparently the planner thinks that, in fact, it doesn't. But the Q3 change is really the most disturbing part of this -- the cost actually goes up with the fix. I haven't figured out whether that's due to some kind of round-off error or whether it's evidence that the patch doesn't properly fix the bug. I wonder whether Richard's rewrite of unique-ification requires some adjustment to the patch. Another test case failure that would have happened had Ashutosh not compensated for it in the patch is with this query: EXPLAIN (COSTS OFF) SELECT x.id, y.id FROM fract_t x LEFT JOIN fract_t y USING (id) ORDER BY x.id ASC LIMIT 10; Now, currently, this chooses a partitionwise-join. The Limit node at the top of the plan tree has a cost of 2.22 and the underlying Merge Append node has a cost of 223.11. But if you apply the fix and not Ashutosh's adjustments to the test case, then you get a non-partitionwise join, where the Limit at the top of the plan tree has a cost of 2.21 and the underlying Merge Append node has a cost of 223.10. Since we're just trying to decide whether to Append some Merge Joins or Merge Join some Appends, it's not surprising that the costs are very close. However, I also find it slightly discouraging in terms of accepting Ashutosh's proposed fix. In the Q1 case, above, we apparently reduce the cost specifically by not flushing the path list. But here, we just end up picking a nearly equivalent path with a nearly-equivalent cost. At least, that means the test case isn't likely to be stable, and we could just patch around that, as Ashutosh did, by suppressing partitionwise join (it is not clear whether this compromises the goals of the test case, but it's not obvious that it does). But it might also be taken as a worrying indication that plans of this form are going to come out as either partitionwise or not based on essentially random factors, which could be viewed as a flaw in the approach. I'm not really sure which way to view it, and if is a flaw in the approach, then I'm not sure what to do instead. Thoughts? -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Oct 29, 2025 at 5:17 AM Robert Haas <robertmhaas@gmail.com> wrote: > What we see here is that, in the case of Q1, the fix reduces the cost > by a large amount, which is the kind of thing you'd hope would happen > when you fix a costing a bug, although I haven't quite figured out why > we get such a large benefit. Likewise, in the second case, the cost > goes down with the fix, although not by a lot. That case is > interesting because the plan selected with the patch is a merge join > of appends of index scans, which of every possible plan shape is > probably the one that benefits least from being performed > partitionwise. If the merge join involved a sort, you'd expect the > partitionwise approach to win, since several smaller sorts figure to > cost less in total than one big sort; but here it doesn't, so there's > little room for the partitionwise nature of the operation to provide a > benefit, and apparently the planner thinks that, in fact, it doesn't. > But the Q3 change is really the most disturbing part of this -- the > cost actually goes up with the fix. I haven't figured out whether > that's due to some kind of round-off error or whether it's evidence > that the patch doesn't properly fix the bug. I wonder whether > Richard's rewrite of unique-ification requires some adjustment to the > patch. I don't think the rewrite of unique-ification requires any adjustment to this patch. I ran Q1 on v18, which does not include the unique-ification changes, and here is what I observed: without Ashutosh's patch, it performs a full partitionwise join; with the patch, it performs one join partitionwise and the other non-partitionwise. The costs of the unpatched versus patched versions on v18 are 2286.11 and 1420.40, respectively, indicating that Ashutosh's patch reduces the cost by a large amount. This matches your observation exactly. I think this suggests that we can rule out the interference from the unique-ification changes. The comment explaining why apply_scanjoin_target_to_paths() throws away all existing paths claims that: * If the rel is partitioned, we want to drop its existing paths and * generate new ones. This function would still be correct if we kept the * existing paths: we'd modify them to generate the correct target above * the partitioning Append, and then they'd compete on cost with paths * generating the target below the Append. However, in our current cost * model the latter way is always the same or cheaper cost, so modifying * the existing paths would just be useless work. However, that reasoning is valid only when all of the existing paths are Appends of Scans or Joins. It does not hold for a partitioned join relation, which can have paths that are Joins of Appends. Therefore, I think there's something wrong with the current logic, and we may need to do something about it. IIUC, Ashutosh's patch avoids discarding existing paths for partitioned join relations, so that we can retain non-partitionwise paths and ensure we don't miss the cheapest path if it happens to be among them. One of my concerns with this approach is that, for a partitionwise join path in the existing paths, we would end up with two paths after apply_scanjoin_target_to_paths(): one with the scan/join target applied above the Append, and one below it. As the aforementioned comment explains, these two paths tend to have the same cost, resulting in redundant work and potentially causing cross-platform plan variations. Maybe we could address this by discarding all existing partitionwise paths and relying on apply_scanjoin_target_to_paths() to rebuild these Append paths after applying the target to all partitions? Another concern I have, though I'm not entirely sure, is about comparing the costs between a partitionwise join path and a non-partitionwise join path. It seems to me that their costs are computed in very different ways, so I'm not sure whether the costs are truly comparable. So I suspect that, with the patch, there may be cases where a lower estimated cost does not necessarily translate to shorter execution time. However, I'm not sure what to do about this. - Richard
Hi Robert,
Richard already covered a lot. I mainly want to reiterate, that a public test case would be immensely helpful.
On Mon, Oct 27, 2025 at 5:12 PM Robert Haas <robertmhaas@gmail.com> wrote:I haven't had a chance just yet to think through all the details of the proposed patch, but I now believe we should commit something along those lines. I still suspect that back-patching is unwise; even though I now agree with Ashutosh's claim that this is a bug, because previous experience with destabilizing plans in back-branches has not been good. Hence, I'm inclined to fix only master. I do think the comments in the patch need some work, and I plan to tackle that tomorrow.It seems that, in the time sense this patch was originally posted, it's been side-swiped by Richard Guo's commits 24225ad9aafc and 9b282a9359a1, with the result that the regression tests now fail with the patch applied, and I'm not immediately certain how to clean that up. I'm also not sure that the way the patch handles the test cases it did adjust is optimal. Here is some preliminary analysis; opinions appreciated. With the patch as last posted applied, I see three regression test failures. The first one is for this query: explain (verbose, costs off) select * from unique_tbl_p t1, unique_tbl_p t2 where (t1.a, t2.a) in (select a, a from unique_tbl_p t3) order by t1.a, t2.a;
You earlier requested a case, where we can in fact measure an advantage of the new plan. I think we won't be able to get rid of the disadvantages. You said yourself beautifully:
I don't actually have a clear understanding of why we need this. In https://www.postgresql.org/message-id/CAKZiRmyaFFvxyEYGG_hu0F-EVEcqcnveH23MULhW6UY_jwykGw%40mail.gmail.com Jakub says that an EDB customer experienced a case where the partitionwise plan took 530+s and the non-partitionwise plan took 23s, but unfortunately there's no public test case, and in the examples shared publicly, either the partionwise plan is actually slower but is mistakenly estimated to be faster, or the two are extremely close to the same speed so it doesn't really matter. So the customer scenario (which is not public) is justification for a code-change, but the publicly-posted examples, as far as I can see, are not.
The Q1 you mentioned sadly isn't a real test case, where I can measure performance impact. More an academic difference in costs, which I don't fully comprehend as of now.
On 2025-10-28 21:17, Robert Haas wrote:
While this is probably a common occurrence, the use of CPU cycles is close enough, that I suspect this wouldn't be a massive issue. The main problem I see between these two very similar plans seem to me the potentially very different memory footprint. The work_mem spill file is still independent per worker node. With the patch, I can easily see a world, where that never becomes a problem on some development database, but on the nearly identical live database. This behavior seems incredibly hard to test for.[...]In the Q1 case, above, we apparently reduce the cost specifically by not flushing the path list. But here, we just end up picking a nearly equivalent path with a nearly-equivalent cost. At least, that means the test case isn't likely to be stable, and we could just patch around that, as Ashutosh did, by suppressing partitionwise join (it is not clear whether this compromises the goals of the test case, but it's not obvious that it does). But it might also be taken as a worrying indication that plans of this form are going to come out as either partitionwise or not based on essentially random factors, which could be viewed as a flaw in the approach. I'm not really sure which way to view it, and if is a flaw in the approach, then I'm not sure what to do instead.
Thoughts?
Did you encounter a case a in production, that made you reevaluate this thread? If so a public reproducer would be very appreciated.
Regards
Arne
On Wed, Oct 29, 2025 at 5:21 AM Richard Guo <guofenglinux@gmail.com> wrote: > I don't think the rewrite of unique-ification requires any adjustment > to this patch. I ran Q1 on v18, which does not include the > unique-ification changes, and here is what I observed: without > Ashutosh's patch, it performs a full partitionwise join; with the > patch, it performs one join partitionwise and the other > non-partitionwise. The costs of the unpatched versus patched versions > on v18 are 2286.11 and 1420.40, respectively, indicating that > Ashutosh's patch reduces the cost by a large amount. This matches > your observation exactly. I think this suggests that we can rule out > the interference from the unique-ification changes. This testing methodology makes some sense to me, but it seems here you have tested Q1 here, which was the good case, rather than Q3, which was the troubling one. > However, that reasoning is valid only when all of the existing paths > are Appends of Scans or Joins. It does not hold for a partitioned > join relation, which can have paths that are Joins of Appends. > Therefore, I think there's something wrong with the current logic, and > we may need to do something about it. I agree. > Maybe we could address this by discarding all existing partitionwise > paths and relying on apply_scanjoin_target_to_paths() to rebuild these > Append paths after applying the target to all partitions? I'm quite afraid of just deleting items out of the pathlist, because the pathlist has to satisfy a complicated set of invariants and it is unclear to me that we wouldn't end up violating some of them. I think we could mitigate the danger if we re-added the old paths we want to keep. That is, set some variable to the pathlist, set the pathlist to NIL, and then iterate over the saved pathlist and call add_path() on each path we wish to keep. That would produce effects similar to modeling the final scan/join target as a separate RelOptInfo, which is what this code is trying to do without actually creating a separate RelOptInfo. Maybe that's overly cautious and would always produce the same results as deleting from the path list, but I'm not sure. Deleting from the pathlist is hypothetically allowed (c.f. comments for set_join_pathlist_hook) but I wonder whether anyone has actually been able to make it work in real life. > Another concern I have, though I'm not entirely sure, is about > comparing the costs between a partitionwise join path and a > non-partitionwise join path. It seems to me that their costs are > computed in very different ways, so I'm not sure whether the costs are > truly comparable. So I suspect that, with the patch, there may be > cases where a lower estimated cost does not necessarily translate to > shorter execution time. However, I'm not sure what to do about this. I think that's a general risk of using a cost model to choose plans, and in that sense I believe it is something we neither can nor should try to fix, here or in general. What I find more concerning about this case than, in certain cases, the costs of a partititionwise path and a non-partitionwise path are extremely close together for very understandable reasons. Hence, the choice will be unpredictable, which I fear might create problems for users. I'm not sure what to do about, though. It's tempting to think that we don't need to consider both a MergeJoin-of-Appends and an Append-of-MergeJoins, which seems to be the case where you get almost-identical costs, but that's actually only true when there's no sorting happening under the merge-joins. Even if that were no issue, it's unclear how we could reasonably avoid considering both of those possibilities given the code structure. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Oct 29, 2025 at 6:43 AM Arne Roland <arne.roland@malkut.net> wrote: > Richard already covered a lot. I mainly want to reiterate, that a public test case would be immensely helpful. I agree, and said the same. > The Q1 you mentioned sadly isn't a real test case, where I can measure performance impact. More an academic differencein costs, which I don't fully comprehend as of now. I don't either, but maybe if I study it (or you or someone else does) we can begin to comprehend it. > Did you encounter a case a in production, that made you reevaluate this thread? If so a public reproducer would be veryappreciated. No, what happened is that this broke a patch I'm working on. The details are lengthy and would take us too far away from the topic of this thread, but the highly-compressed version is that I spent about six hours going "wait, why the heck isn't this working?" and eventually traced it back to the pathlist for a partitionwise join getting zapped. I might have to bolt on some kind of a fix to un-break that patch for now, but it's not relevant in terms of constructing a reproducer for this problem. I think the best shot at coming up with a reproducer here is to study the cost differences in the queries where the plan changes with the fix, particularly Q1 from my prior email. While I agree with you that at present that is just a numerical effect and not a real performance effect, we don't even have an explanation for how the numerical effect is possible. It seems like a good idea to figure that out. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Oct 29, 2025 at 8:47 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I think the best shot at coming up with a
> reproducer here is to study the cost differences in the queries where
> the plan changes with the fix, particularly Q1 from my prior email.
So, the really interesting thing about Q1 is that it contains a join
which inflates the row count by a factor of about 84. We first join
t1, which has 1001 rows, to t3, which has 1001 rows, and get 1001
rows. Then we join to t2, which also has 1001 rows, and we get 83503
rows. It is estimated to be mildly more efficient to perform the t1-t3
join partition-wise: it costs 98.69 cost units to do it partitionwise
and 104.56 cost units to do it non-partitionwise. However, the planner
believes that doing the subsequent join to t2 in partitionwise fashion
is a bad idea. The query has an ORDER BY clause, which means that
after we finish doing the partitionwise part of the operation, we need
to perform a Merge Append to restore the sort order. If we do the
Merge Append before joining to t2, we only have to Merge Append 1001
rows, but if we do the Merge Append after joining to t2, we have to
Merge Append 83530 rows. The estimated cost of Merge Append is
proportional to the number of input rows, so doing that MergeAppend
after the join to t2 is estimated to be 84 times as expensive. In some
situations, we might make up for that loss by being able to do the
join more efficiently, but here the planner does not believe that to
be the case case: we can do the join to t2 by a Merge Join over an
Append over one index scan per partition, and there's basically no
overhead vs. a partitionwise join. Hence, from the planner's point of
view, doing the join to t2 in partitionwise fashion is a significant
loss.
I had difficulty reproducing this theoretical performance regression
in practice. I found that doing the whole thing partitionwise, doing
the whole thing non-partitionwise, and doing only the t1-t3 join
partitionwise weren't that different in runtime, and the partitionwise
approach actually seemed to be a little faster. But I constructed the
following example, similar to but simpler than the one in the
regression tests, which does show a regression for me in practice:
drop table if exists dupfest;
create table dupfest (a text) partition by range(a);
create table dupfest1 partition of dupfest for values from ('0') to ('3');
create table dupfest2 partition of dupfest for values from ('3') to ('6');
create table dupfest3 partition of dupfest for values from ('6') to ('A');
insert into dupfest
select '0' from generate_series(0, 10000) i
union all
select '3' from generate_series(0, 10000) i
union all
select '6' from generate_series(0, 10000) i;
create index on dupfest(a);
analyze dupfest;
set max_parallel_workers_per_gather = 0;
My test query was:
select count(*) from (select * from dupfest t1, dupfest t2 where t1.a
= t2.a order by t1.a offset 0);
I want to start by saying that I haven't tried super-hard to do
rigorous benchmarking. This is a debug-enabled, assertion-enabled
build with patched source code. Results may not be fully
representative. But here's what I found. EXPLAIN ANALYZE of this query
without a partitionwise join took 30.8 seconds; without EXPLAIN
ANALYZE, it ran in 15.4 seconds. With partitionwise join, EXPLAIN
ANALYZE of the query ran in 89.6 seconds; without EXPLAIN ANALYZE, it
ran in 64.5 seconds. The plan without partitionwise join was a merge
join with an append of index only scans on both sides and a
materialize node on the inner side. With partitionwise join, it
switched to Nested Loop plans with index-only scans on the outer side
and a materialize node over a sequential scan on the inner side,
followed by a Merge Append.
A notable point here is that the joins take about the same amount of
time in both plans. In the EXPLAIN ANALYZE output, we see the three
joins in the partitionwise plan taking a total of 24.6 seconds, and
the single join in the non-partitionwise plan taking 24 seconds
(exclusive of times for child nodes). However, the two Append nodes in
the non-partitionwise plan run for a total of 2.5 *milliseconds* while
the single Merge Append node in the partitionwise plan runs for 58.2
seconds (again, exclusive of times for child nodes). Obviously,
EXPLAIN ANALYZE distorts the runtime a lot, but I think the overall
point is nonetheless fairly clear: running a lot of tuples through a
Merge Append node is potentially expensive, and it can be worth
eschewing a partitionwise join to avoid that.
I also tried running the same test without the "order by t1.a". With
that change EXPLAIN ANALYZE took 24.3 seconds without partitionwise
join and 34.8 seconds with partitionwise join. The times without
EXPLAIN ANALYZE were quite close, around 15 seconds either way, but it
looks to me as though the partitionwise plan was probably still a bit
worse. What I think is happening here is that even running a large
number of tuples through Append can have enough overhead to matter in
extreme cases, but EXPLAIN ANALYZE significantly increases the cost of
entering and exiting nodes, so in that case the difference is much
easier to measure.
I don't know whether the EDB customer problem that started this thread
was of the same type demonstrated here or not. It may well have been
something else. However, unless I've fouled up the test case shown
above in some way, which is not impossible, this does demonstrate that
it is possible, at least in corner cases, to run into scenarios where
a partitionwise join is worse than a non-partitionwise join. In this
example, the reason it's worse is because postponing the MergeAppend
until a later stage results in the MergeAppend seeing a much larger
number of rows.
--
Robert Haas
EDB: http://www.enterprisedb.com
On 2025-10-29 17:47, Robert Haas wrote:
> On Wed, Oct 29, 2025 at 8:47 AM Robert Haas <robertmhaas@gmail.com> wrote:
>> I think the best shot at coming up with a
>> reproducer here is to study the cost differences in the queries where
>> the plan changes with the fix, particularly Q1 from my prior email.
> So, the really interesting thing about Q1 is that it contains a join
> which inflates the row count by a factor of about 84. We first join
> t1, which has 1001 rows, to t3, which has 1001 rows, and get 1001
> rows. Then we join to t2, which also has 1001 rows, and we get 83503
> rows. It is estimated to be mildly more efficient to perform the t1-t3
> join partition-wise: it costs 98.69 cost units to do it partitionwise
> and 104.56 cost units to do it non-partitionwise. However, the planner
> believes that doing the subsequent join to t2 in partitionwise fashion
> is a bad idea. The query has an ORDER BY clause, which means that
> after we finish doing the partitionwise part of the operation, we need
> to perform a Merge Append to restore the sort order. If we do the
> Merge Append before joining to t2, we only have to Merge Append 1001
> rows, but if we do the Merge Append after joining to t2, we have to
> Merge Append 83530 rows. The estimated cost of Merge Append is
> proportional to the number of input rows, so doing that MergeAppend
> after the join to t2 is estimated to be 84 times as expensive. In some
> situations, we might make up for that loss by being able to do the
> join more efficiently, but here the planner does not believe that to
> be the case case: we can do the join to t2 by a Merge Join over an
> Append over one index scan per partition, and there's basically no
> overhead vs. a partitionwise join. Hence, from the planner's point of
> view, doing the join to t2 in partitionwise fashion is a significant
> loss.
>
> I had difficulty reproducing this theoretical performance regression
> in practice. I found that doing the whole thing partitionwise, doing
> the whole thing non-partitionwise, and doing only the t1-t3 join
> partitionwise weren't that different in runtime, and the partitionwise
> approach actually seemed to be a little faster. But I constructed the
> following example, similar to but simpler than the one in the
> regression tests, which does show a regression for me in practice:
>
> drop table if exists dupfest;
> create table dupfest (a text) partition by range(a);
> create table dupfest1 partition of dupfest for values from ('0') to ('3');
> create table dupfest2 partition of dupfest for values from ('3') to ('6');
> create table dupfest3 partition of dupfest for values from ('6') to ('A');
> insert into dupfest
> select '0' from generate_series(0, 10000) i
> union all
> select '3' from generate_series(0, 10000) i
> union all
> select '6' from generate_series(0, 10000) i;
> create index on dupfest(a);
> analyze dupfest;
> set max_parallel_workers_per_gather = 0;
>
> My test query was:
>
> select count(*) from (select * from dupfest t1, dupfest t2 where t1.a
> = t2.a order by t1.a offset 0);
Thank you, that is very helpful!
> I want to start by saying that I haven't tried super-hard to do
> rigorous benchmarking. This is a debug-enabled, assertion-enabled
> build with patched source code. Results may not be fully
> representative. But here's what I found. EXPLAIN ANALYZE of this query
> without a partitionwise join took 30.8 seconds; without EXPLAIN
> ANALYZE, it ran in 15.4 seconds. With partitionwise join, EXPLAIN
> ANALYZE of the query ran in 89.6 seconds; without EXPLAIN ANALYZE, it
> ran in 64.5 seconds. The plan without partitionwise join was a merge
> join with an append of index only scans on both sides and a
> materialize node on the inner side. With partitionwise join, it
> switched to Nested Loop plans with index-only scans on the outer side
> and a materialize node over a sequential scan on the inner side,
> followed by a Merge Append.
The margins are slightly lower at my end, but I can clearly reproduce
this locally.
> A notable point here is that the joins take about the same amount of
> time in both plans. In the EXPLAIN ANALYZE output, we see the three
> joins in the partitionwise plan taking a total of 24.6 seconds, and
> the single join in the non-partitionwise plan taking 24 seconds
> (exclusive of times for child nodes). However, the two Append nodes in
> the non-partitionwise plan run for a total of 2.5 *milliseconds* while
> the single Merge Append node in the partitionwise plan runs for 58.2
> seconds (again, exclusive of times for child nodes). Obviously,
> EXPLAIN ANALYZE distorts the runtime a lot, but I think the overall
> point is nonetheless fairly clear: running a lot of tuples through a
> Merge Append node is potentially expensive, and it can be worth
> eschewing a partitionwise join to avoid that.
Can you help me, why we even need a Merge Append node in the partition
wise case? The sorted streams we have are sorted by the key column of
the partitioning and hence a simple Append node should suffice anyways, no?
Either way running less trough any node can make stuff significantly
faster. This case happens, if we join on something, which includes the
partition key, but is not unique. I failed to consider that. My
customers almost always joined on pk in these partition wise cases.
> I also tried running the same test without the "order by t1.a". With
> that change EXPLAIN ANALYZE took 24.3 seconds without partitionwise
> join and 34.8 seconds with partitionwise join. The times without
> EXPLAIN ANALYZE were quite close, around 15 seconds either way, but it
> looks to me as though the partitionwise plan was probably still a bit
> worse. What I think is happening here is that even running a large
> number of tuples through Append can have enough overhead to matter in
> extreme cases, but EXPLAIN ANALYZE significantly increases the cost of
> entering and exiting nodes, so in that case the difference is much
> easier to measure.
While I see a massive degradation for the explain analyze case, without
it in the unsorted case the partition wise variant runs consistently 8-9
% faster on my local machine. The explain reveals to me a quicker Hash
Join. (I suspect due to quicker hash map lookups.) I generally believe
taking the explain analyze timings here are a bit misleading, because
they add very disproportional overhead.
> I don't know whether the EDB customer problem that started this thread
> was of the same type demonstrated here or not. It may well have been
> something else. However, unless I've fouled up the test case shown
> above in some way, which is not impossible, this does demonstrate that
> it is possible, at least in corner cases, to run into scenarios where
> a partitionwise join is worse than a non-partitionwise join. In this
> example, the reason it's worse is because postponing the MergeAppend
> until a later stage results in the MergeAppend seeing a much larger
> number of rows.
Either way: Despite me not encountering this in the wild, your sorted
case doesn't seem too exotic to me and demonstrates the need to do
something.
This virtually equivalent query issue occurs when the join condition is
(almost) unique. The different amount of tuples to process clearly
occurs when they are not.
Part of me wonders whether it could be worthwhile to do something
differently in the case, that the join key is unique. I don't like to
introduce another level of complexity, but it's interesting how close
the costs without the (Merge) Append nodes are. We again have less than
0.0018 % difference on my end, which is essentially asking for problems
across platforms.
Regards
Arne
On Wed, Oct 29, 2025 at 8:36 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Oct 29, 2025 at 5:21 AM Richard Guo <guofenglinux@gmail.com> wrote: > > I don't think the rewrite of unique-ification requires any adjustment > > to this patch. I ran Q1 on v18, which does not include the > > unique-ification changes, and here is what I observed: without > > Ashutosh's patch, it performs a full partitionwise join; with the > > patch, it performs one join partitionwise and the other > > non-partitionwise. The costs of the unpatched versus patched versions > > on v18 are 2286.11 and 1420.40, respectively, indicating that > > Ashutosh's patch reduces the cost by a large amount. This matches > > your observation exactly. I think this suggests that we can rule out > > the interference from the unique-ification changes. > > This testing methodology makes some sense to me, but it seems here you > have tested Q1 here, which was the good case, rather than Q3, which > was the troubling one. That said, after some investigation, I believe your conclusion to be correct. What seems to be happening with Q3 is that the higher-cost path (27.47, one partitionwise join) is added before the absolute cheapest path (27.27, two partitionwise joins) and that's not enough difference for compare_path_costs_fuzzily with STD_FUZZ_FACTOR to prefer the second one over the first. If I raise STD_FUZZ_FACTOR from 0.5 to 5, to make the costs of the two paths more different, then this behavior vanishes. So I agree that this seems to have nothing to do with your work; it appears that your test cases just got swept up in it incidentally. Do you have any thoughts on how we might adjust these test cases? -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Oct 29, 2025 at 2:06 PM Arne Roland <arne.roland@malkut.net> wrote: > This virtually equivalent query issue occurs when the join condition is > (almost) unique. The different amount of tuples to process clearly > occurs when they are not. I'm having trouble interpreting this. If it's important, please clarify and show an example. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Oct 29, 2025 at 2:06 PM Arne Roland <arne.roland@malkut.net> wrote:This virtually equivalent query issue occurs when the join condition is (almost) unique. The different amount of tuples to process clearly occurs when they are not.I'm having trouble interpreting this. If it's important, please clarify and show an example.
Thank you for asking. I hope my explanations are clear. If not I am happy to explain a particular thing in more detail.
The main factor of your example is, that the amount of rows handled by the (Merge) Append is different.
In the partition wise join we process a lot of rows, namely 300060003:
Aggregate (cost=12130962.32..12130962.33 rows=1 width=8)
-> Merge Append (cost=0.88..8380212.28 rows=300060003 width=34)
Sort Key: t1.a
-> Nested Loop (cost=0.29..1500664.33 rows=100020001 width=34)
Join Filter: (t1_1.a = t2_1.a)
-> Index Only Scan using dupfest1_a_idx on dupfest1 t1_1 (cost=0.29..194.30 rows=10001 width=2)
-> Materialize (cost=0.00..195.01 rows=10001 width=2)
-> Seq Scan on dupfest1 t2_1 (cost=0.00..145.01 rows=10001 width=2)
-> Nested Loop (cost=0.29..1500664.33 rows=100020001 width=34)
Join Filter: (t1_2.a = t2_2.a)
-> Index Only Scan using dupfest2_a_idx on dupfest2 t1_2 (cost=0.29..194.30 rows=10001 width=2)
-> Materialize (cost=0.00..195.01 rows=10001 width=2)
-> Seq Scan on dupfest2 t2_2 (cost=0.00..145.01 rows=10001 width=2)
-> Nested Loop (cost=0.29..1500664.33 rows=100020001 width=34)
Join Filter: (t1_3.a = t2_3.a)
-> Index Only Scan using dupfest3_a_idx on dupfest3 t1_3 (cost=0.29..194.30 rows=10001 width=2)
-> Materialize (cost=0.00..195.01 rows=10001 width=2)
-> Seq Scan on dupfest3 t2_3 (cost=0.00..145.01 rows=10001 width=2)
In the non partitioned case we have less rows, because we have *more* rows after joining the two relations, because the join has more rows, than either of the partitioned tables had before. The Append only processes 30003 rows.
Aggregate (cost=8253191.53..8253191.54 rows=1 width=8) (actual time=64208.334..64208.337 rows=1 loops=1)
-> Merge Join (cost=1.71..4502441.21 rows=300060025 width=34) (actual time=28.900..51731.558 rows=300060003 loops=1)
Merge Cond: (t1.a = t2.a)
-> Append (cost=0.86..732.91 rows=30003 width=2) (actual time=0.036..7.044 rows=30003 loops=1)
-> Index Only Scan using dupfest1_a_idx on dupfest1 t1_1 (cost=0.29..194.30 rows=10001 width=2) (actual time=0.034..1.436 rows=10001 loops=1)
-> Index Only Scan using dupfest2_a_idx on dupfest2 t1_2 (cost=0.29..194.30 rows=10001 width=2) (actual time=0.011..1.513 rows=10001 loops=1)
-> Index Only Scan using dupfest3_a_idx on dupfest3 t1_3 (cost=0.29..194.30 rows=10001 width=2) (actual time=0.011..1.408 rows=10001 loops=1)
-> Materialize (cost=0.86..807.92 rows=30003 width=2) (actual time=0.014..13787.902 rows=300050003 loops=1)
-> Append (cost=0.86..732.91 rows=30003 width=2) (actual time=0.011..4.225 rows=30003 loops=1)
-> Index Only Scan using dupfest1_a_idx on dupfest1 t2_1 (cost=0.29..194.30 rows=10001 width=2) (actual time=0.010..0.698 rows=10001 loops=1)
-> Index Only Scan using dupfest2_a_idx on dupfest2 t2_2 (cost=0.29..194.30 rows=10001 width=2) (actual time=0.005..0.708 rows=10001 loops=1)
-> Index Only Scan using dupfest3_a_idx on dupfest3 t2_3 (cost=0.29..194.30 rows=10001 width=2) (actual time=0.006..0.776 rows=10001 loops=1)
A very common case I have seen for partitionwise joins is some foreign key structure. There are several design paradigms, that make it common. If you join across some foreign key structure, the amount of tuples doesn't increase. Think the table definition of
create table dupfest (a text, id bigint not null generated always as identity, primary key (a, id), foreign key (a, id) references dupfest (a, id)) partition by range(a);
With the query
select count(*) from (select * from dupfest t1, dupfest t2 where t1.a = t2.a and t1.id = t2.id order by t1.a offset 0);
If we join alongside a foreign key from t1 to t2, we know that the join can't contain more tuples than t1 did. This may seem like a very special case, but it's as far as enable_partitionwise_join = true is concerned definitely a common use case.
If we remove the order condition these cases also have very similar performance behavior (The difference in time is less than the noise threshold of my benchmark (regardless of the amount of data as long as work_mem is sufficiently large).):
select count(*) from (select * from dupfest t1, dupfest t2 where t1.a
= t2.a and t1.id = t2.id offset 0);
If we enter 10 times the data, we see a only marginal difference in the cost. (And similar performance differences, unless I do the explain analyze, which ruins the time readings again.)
My second sentence just captured the mundane observation, if the join has significantly more tuples, than any base relation, the place of the (Merge) Append might be more relevant. If I join everything with a generate_series(1, 30000) I get more tuples to process.
I'd like to make one more side note about this example: The planner punishes the partitionwise join for having an extra node, that emits N rows (three Hash joins + Append vs two Appends + Hash Join). This plan is chosen because of the cpu_tuple_cost. I'm happy it picks the plan with the smaller memory footprint, but in my real world experience for a timing based approach the default cpu_tuple_cost tends to be too high to get a fair comparison between partitionwise and non partitionwise joins.
All the best
Arne
On Wed, Oct 29, 2025 at 9:23 PM Arne Roland <arne.roland@malkut.net> wrote: > The main factor of your example is, that the amount of rows handled by the (Merge) Append is different. Right. Although that's the main thing here, I am inclined to suspect there are other ways to hit this problem, maybe ways that are more likely to happen in the real world, because... > My second sentence just captured the mundane observation, if the join has significantly more tuples, than any base relation,the place of the (Merge) Append might be more relevant. If I join everything with a generate_series(1, 30000) Iget more tuples to process. ...as you imply, joins that inflate the row count are somewhat uncommon. They definitely do happen, but they're not the most typical pattern, and there might well be other reasons why a partitionwise join fails to win that we haven't figured out yet. These could even be cases where, for example, a certain optimization that works in the non-partitionwise case is not preserved in the partitionwise case. I feel like I now understand *one* case where Ashutosh's patch can make a demonstrable positive difference, but whether that's the only case that exists seems quite uncertain. > I'd like to make one more side note about this example: The planner punishes the partitionwise join for having an extranode, that emits N rows (three Hash joins + Append vs two Appends + Hash Join). This plan is chosen because of the cpu_tuple_cost.I'm happy it picks the plan with the smaller memory footprint, but in my real world experience for a timingbased approach the default cpu_tuple_cost tends to be too high to get a fair comparison between partitionwise and nonpartitionwise joins. Have you localized the problem to cpu_tuple_cost specifically, vs. cpu_index_tuple_cost or cpu_operator_cost? I've generally found that I need to reduce random_page_cost and seq_page_cost significantly to avoid getting sequential scans when index scans would be more reasonable, but that goes in the opposite direction as what you suggest here, in that it brings the I/O and CPU costs closer together, whereas your suggestion would push them further apart. I remember that Kevin Grittner used to say that the default value of this parameter was bad, too, but he recommended *raising* it: https://www.postgresql.org/message-id/1385148245.49487.YahooMailNeo%40web162904.mail.bf1.yahoo.com https://www.postgresql.org/message-id/4FF179780200002500048CBD@gw.wicourts.gov https://www.postgresql.org/message-id/CACjxUsNp4uEx3xsunw4wVpBDVomas7o6hnv_49bSbaz-HAVdyA%40mail.gmail.com I don't actually know what's best in terms of settings in this area. I don't have experience tuning for partitionwise join specifically. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Oct 30, 2025 at 11:52 AM Robert Haas <robertmhaas@gmail.com> wrote: > Right. Although that's the main thing here, I am inclined to suspect > there are other ways to hit this problem, maybe ways that are more > likely to happen in the real world, because... And just like that, I found another way that this can happen. Consider this query from the regression tests: SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b; Each of prt1 and prt2 have three partitions. Since t1.a = t2.a and t1.a = t2.b, the planner deduces that t2.a = t2.b. The only rows from t2 where a = b are in the first partition. The planner therefore estimates that if it just does a Merge Join between the two partitioned tables, the Merge Join will stop early. There are actually nine rows in t2 where a = b, but the planner's estimate is just three rows, so it's understandable that it will very quickly run out of rows on the t2 side of the join. Thus, in the non-partitionwise plan, while its estimated cost to scan t1 is 44.83 and its estimated cost to scan t2 is 5.55, the estimated cost of the join is only 7.99, reflecting the fact that it doesn't anticipate having to actually finish the scan of t1. Now, if it does a partitionwise join, it still picks a Merge Join for the prt1_p1/prt2_p1 sub-join, and that can still stop early. But for the prt1_p2/prt2_p2 and prt1_p3/prt2_p3 joins, it picks hash joins, which as far as the planner knows can't stop early, so there's no opportunity to get a "discount" on the cost of scanning any of those tables. As a result, the estimated cost of this plan ends up being 11.53, clearly more than the non-partitionwise estimated cost. In this case, the planner's methodology doesn't really make a lot of sense when you stop to think about it. If the planner is correct that the non-partitionwise join will stop early, then the hash joins that are chosen in the partitionwise scan will stop early, because the non-parallel hashjoin code notices when the hash table built for the inner side is empty and bails out without finishing the outer scan. But the planner code is understandably reluctant to bet on a table being completely empty for costing purposes. Performing a non-partitionwise join allows the planner to make an end run around this restriction: it can bet on the combined inner table ending up with no rows contributed by the second or third child tables without actually betting on any relation being completely empty. Consequently, it doesn't seem like this type of case can account for the original report of a massive real-world run time regression. The planner's mental gymnastics here cause it to think that a non-partitionwise plan will be faster, but as far as I can tell, there's no real reason to expect that it actually will be. -- Robert Haas EDB: http://www.enterprisedb.com
On 2025-10-30 16:52, Robert Haas wrote: > Have you localized the problem to cpu_tuple_cost specifically, vs. > cpu_index_tuple_cost or cpu_operator_cost? I've generally found that I > need to reduce random_page_cost and seq_page_cost significantly to > avoid getting sequential scans when index scans would be more > reasonable, but that goes in the opposite direction as what you > suggest here, in that it brings the I/O and CPU costs closer together, > whereas your suggestion would push them further apart. I remember that > Kevin Grittner used to say that the default value of this parameter > was bad, too, but he recommended*raising* it: In these particular cases I looked at the cpu_tuple cost is the main issue, yes. I am not saying that the cpu_tuple_cost is the real culprit in general. I tune random_page_cost and seq_page cost frequently, but as you rarely need to worry about the different cpu parts. I remember one database where I lowered it to the default again, but that was a few years ago. I am no expert on cpu_tuple_cost generally. On 2025-10-30 20:57, Robert Haas wrote: > On Thu, Oct 30, 2025 at 11:52 AM Robert Haas <robertmhaas@gmail.com> wrote: >> Right. Although that's the main thing here, I am inclined to suspect >> there are other ways to hit this problem, maybe ways that are more >> likely to happen in the real world, because... > And just like that, I found another way that this can happen. Consider > this query from the regression tests: > > SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a > AND t1.a = t2.b ORDER BY t1.a, t2.b; > > Each of prt1 and prt2 have three partitions. Since t1.a = t2.a and > t1.a = t2.b, the planner deduces that t2.a = t2.b. The only rows from > t2 where a = b are in the first partition. The planner therefore > estimates that if it just does a Merge Join between the two > partitioned tables, the Merge Join will stop early. There are actually > nine rows in t2 where a = b, but the planner's estimate is just three > rows, so it's understandable that it will very quickly run out of rows > on the t2 side of the join. Thus, in the non-partitionwise plan, while > its estimated cost to scan t1 is 44.83 and its estimated cost to scan > t2 is 5.55, the estimated cost of the join is only 7.99, reflecting > the fact that it doesn't anticipate having to actually finish the scan > of t1. > > Now, if it does a partitionwise join, it still picks a Merge Join for > the prt1_p1/prt2_p1 sub-join, and that can still stop early. But for > the prt1_p2/prt2_p2 and prt1_p3/prt2_p3 joins, it picks hash joins, > which as far as the planner knows can't stop early, so there's no > opportunity to get a "discount" on the cost of scanning any of those > tables. As a result, the estimated cost of this plan ends up being > 11.53, clearly more than the non-partitionwise estimated cost. > > In this case, the planner's methodology doesn't really make a lot of > sense when you stop to think about it. If the planner is correct that > the non-partitionwise join will stop early, then the hash joins that > are chosen in the partitionwise scan will stop early, because the > non-parallel hashjoin code notices when the hash table built for the > inner side is empty and bails out without finishing the outer scan. > But the planner code is understandably reluctant to bet on a table > being completely empty for costing purposes. Performing a > non-partitionwise join allows the planner to make an end run around > this restriction: it can bet on the combined inner table ending up > with no rows contributed by the second or third child tables without > actually betting on any relation being completely empty. > > Consequently, it doesn't seem like this type of case can account for > the original report of a massive real-world run time regression. The > planner's mental gymnastics here cause it to think that a > non-partitionwise plan will be faster, but as far as I can tell, > there's no real reason to expect that it actually will be. 100%. I think such cases are common. I had once a dirty hack, that created several equivalent cases (most of them not even partitioning related) of sql queries planed them and tried to execute the ones with the lowest cost. I ran it on my OLAP benchmark back then and noticed, that the average overall execution time (not the gazillion times of additional planning) suffered from this. If we give the planner a lot of options, it has more chances to do bad estimates. Quite often the estimations for provable cardinality equivalent sub paths are vastly different. It would be so immensely helpful if we could reason about that a bit better, but I have no idea how to do that. Regards Arne
On Fri, Oct 31, 2025 at 4:57 AM Robert Haas <robertmhaas@gmail.com> wrote: > And just like that, I found another way that this can happen. Consider > this query from the regression tests: > > SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a > AND t1.a = t2.b ORDER BY t1.a, t2.b; I observed something interesting about this query. If you swap the two join conditions, you should theoretically get a semantically equivalent query. SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.a = t2.a ORDER BY t1.a, t2.b; However, unlike the original query, the planner estimates that a partitionwise join is cheaper than a non-partitionwise join for this version, with costs of 12.74 and 14.24, respectively. What's more surprising is that the non-partitionwise plans for these two queries differ significantly in cost. The original query has an estimated cost of 7.86, while the modified version's cost is 14.24. This also indicates that the discrepancy is unrelated to partitionwise join. I looked into this a bit and traced it to mergejoinscansel(). This function estimates the value ranges of both inputs to determine how much of the input will actually be read, since a merge join can stop as soon as either input is exhausted. For the original query, the merge clause is "t1.a = t2.a", and the function estimates the maximum value of the right-side variable (t2.a) as 24. For the modified query, the merge clause becomes "t1.a = t2.b", and the estimated maximum value of the right-side variable (t2.b) is 597. This isn't actually incorrect given how get_variable_range() works: select max(a), max(b) from prt2; max | max -----+----- 24 | 597 (1 row) However, the logic overlooks the fact that t2.a is always constrained to be equal to t2.b, meaning their value ranges should be identical. I think we may need to do something here. However, it's a bit off-topic for this thread. I'm just noting it here in case anyone else is interested. - Richard
On Thu, Oct 30, 2025 at 3:29 AM Robert Haas <robertmhaas@gmail.com> wrote: > That said, after some investigation, I believe your conclusion to be > correct. What seems to be happening with Q3 is that the higher-cost > path (27.47, one partitionwise join) is added before the absolute > cheapest path (27.27, two partitionwise joins) and that's not enough > difference for compare_path_costs_fuzzily with STD_FUZZ_FACTOR to > prefer the second one over the first. If I raise STD_FUZZ_FACTOR from > 0.5 to 5, to make the costs of the two paths more different, then this > behavior vanishes. So I agree that this seems to have nothing to do > with your work; it appears that your test cases just got swept up in > it incidentally. > > Do you have any thoughts on how we might adjust these test cases? For Q1, the plan change does not appear to compromise its original purpose. The test case is meant to verify that unique-ification works correctly with partitionwise joins, so as long as the join to t3 is performed in a partitionwise manner, we're fine. For Q2, I suspect that the cost estimation issue in mergejoin may be affecting the planner's choice of plan. When I set enable_mergejoin to off, I was able to get a partitionwise join plan again. Therefore, I think we can modify the test case to manually disable mergejoin for this query. For Q3, you can get a plan with full partitionwise joins again by removing either the clause "t1.c = 0" or "t1.b = 0", and doing so doesn't change the query's output. You can also get a fully partitionwise join plan by removing both clauses, though in that case the query output becomes too large to include in a test case. - Richard
On Fri, Oct 31, 2025 at 8:21 AM Richard Guo <guofenglinux@gmail.com> wrote:
> > Do you have any thoughts on how we might adjust these test cases?
>
> For Q1, the plan change does not appear to compromise its original
> purpose. The test case is meant to verify that unique-ification works
> correctly with partitionwise joins, so as long as the join to t3 is
> performed in a partitionwise manner, we're fine.
>
> For Q2, I suspect that the cost estimation issue in mergejoin may be
> affecting the planner's choice of plan. When I set enable_mergejoin
> to off, I was able to get a partitionwise join plan again. Therefore,
> I think we can modify the test case to manually disable mergejoin for
> this query.
>
> For Q3, you can get a plan with full partitionwise joins again by
> removing either the clause "t1.c = 0" or "t1.b = 0", and doing so
> doesn't change the query's output. You can also get a fully
> partitionwise join plan by removing both clauses, though in that case
> the query output becomes too large to include in a test case.
Thank you, this is extremely helpful and I very much appreciate you
putting in the time to assemble these recommendations. I adjusted
these test cases as per these suggestions. I then realized that we
also need to examine the two test cases that the existing patch
already adjusted. In one of those cases, Ashutosh simply disabled
partitionwise joins, but looking up a little higher in the file, I saw
from the comments that the test case was intended to test a
partitionwise join, so that didn't seem like a good solution. Instead,
I adjusted the test case to run with merge joins turned off, as you
suggested for Q2. The resulting plan is slightly different from the
current one, but it seems like it is still testing what the original
author intended to test. But there's still one problematic case, which
is different from all the ones we've examined so far:
EXPLAIN (COSTS OFF)
SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1_adv t1 LEFT JOIN
plt2_adv t2
ON (t1.c = t2.c)) FULL JOIN plt3_adv t3 ON (t1.c = t3.c) WHERE
coalesce(t1.a, 0)
% 5 != 3 AND coalesce(t1.a, 0) % 5 != 4 ORDER BY t1.c, t1.a, t2.a, t3.a;
Currently, this produces a partitionwise join with each sub-join being
implemented as a hash join. With the patch, it switches to using a
non-partitionwise hash join. It appears to me that the issue is that
the cardinality estimation is less accurate when planning in
non-partitionwise fashion. If we disregard the WHERE clause for the
moment for simplicity and just look at the joins, they produce a total
of 305 rows: 55 from the first set of parittions, and 250 from the
second set of partitions. With partitionwise join, we estimate that
the join between the first pair of partitions will produce 42 rows and
the second set is perfectly estimated at 250 for a total estimate of
292. Without partitionwise join, we instead estimate 225 rows. The
planner naturally estimates that processing fewer rows will be less
expensive than processing more of them, and so picks the
non-partitionwise plan.
I think this is the only case we've seen so far where the proposed fix
could lead to a regression. In the first type of case, where running a
lot of tuples through an Append or Merge Append node, the planner is
correct to consider abandoning partitionwise join if so doing will
ameliorate that problem to a sufficient degree. In the second type of
case, where a non-partitionwise plan is chosen because we realize that
a Merge Join could stop early, I don't think it should make a whole
lot of difference. We'll only believe we can stop reading one side of
the input early if there's no Sort node present, and as discussed
earlier, Merge Joins without Sorts don't really benefit from being
done partitionwise; it's kind of six of one, half a dozen of the
other. Maybe there's some subtly here that I haven't quite understood,
but it doesn't seem like this case is a big problem. However, in this
situation, the proposed fix really could mess somebody up: there is
ever reason to believe that the partitionwise plan is strictly better
than the non-partitionwise plan, and we only choose the
non-partitionwise plan due to bad estimation. That seems like it could
be a common scenario, because I think we should expect that
partitionwise estimates will, on average, be more estimate than
non-partitionwise estimates.
Said differently, this kind of scenario where we don't really want the
non-partitionwise path to win even if it looks cheaper on paper. I
don't really know what to do about that, though. The only way to
improve the non-partitionwise estimate would be to do partitionwise
estimation all the time, regardless of whether we think we're going to
do a partitionwise join. Maybe that's a good idea and maybe it isn't,
but even if it is, it's probably about four orders of magnitude more
code change than this patch is presently contemplating, so linking the
two of them doesn't seem like a good plan. One option is to just go
ahead and commit the fix as proposed, and hope that regressions aren't
too common. The only other idea that I have at the moment is to invent
something like enable_partitionwise_join = { off | on | always }, thus
giving a user who is harmed by this change a way to restore the
current behavior.
Thoughts?
(The attached patch leaves this problematic test case failing and
makes the others pass as per the above discussion.)
--
Robert Haas
EDB: http://www.enterprisedb.com