Обсуждение: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)

Поиск

Список

Период

Сортировка

Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)

От

Richard Guo

Дата:

21 июля 2023 г., 13:02:10

This Assert failure can be reproduced with the query below.

create table part_tbl (a integer) partition by range (a);
create table part_tbl1 partition of part_tbl for values from (0) to (100);
set enable_partitionwise_join to on;

explain (costs off)
select * from part_tbl t1
left join part_tbl t2 on t1.a = t2.a
left join part_tbl t3 on t2.a = t3.a;
server closed the connection unexpectedly

This should be an oversight in 9df8f903. It seems that the new added
function add_outer_joins_to_relids() does not cope well with child
joins. The 'input_relids' for a child join is the relid sets of child
rels while 'othersj->min_xxxhand' refers to relids of parent rels. So
there would be problem when we add the relids of the pushed-down joins.

Instead of fixing add_outer_joins_to_relids() to cope with child joins,
I'm wondering if we can build join relids for a child join from its
parent by adjust_child_relids, something like attached.

Thanks
Richard

Вложения

v1-0001-Fix-how-we-build-join-relids-for-child-join.patch

Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)

От

Tom Lane

Дата:

21 июля 2023 г., 19:02:55

Richard Guo <guofenglinux@gmail.com> writes:
> This should be an oversight in 9df8f903.  It seems that the new added
> function add_outer_joins_to_relids() does not cope well with child
> joins.  The 'input_relids' for a child join is the relid sets of child
> rels while 'othersj->min_xxxhand' refers to relids of parent rels.  So
> there would be problem when we add the relids of the pushed-down joins.

Indeed.  Adding the OJ relid itself works fine, but we won't get the
required matches when we scan the join_info_list.

> Instead of fixing add_outer_joins_to_relids() to cope with child joins,
> I'm wondering if we can build join relids for a child join from its
> parent by adjust_child_relids, something like attached.

That looks like a good solid solution.  Pushed with a bit of
editorialization --- mostly, that I put the test case into
partition_join.sql where there's already suitable test tables.

            regards, tom lane

Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)

От

Richard Guo

Дата:

24 июля 2023 г., 05:47:00

On Sat, Jul 22, 2023 at 12:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Richard Guo <guofenglinux@gmail.com> writes:
> Instead of fixing add_outer_joins_to_relids() to cope with child joins,
> I'm wondering if we can build join relids for a child join from its
> parent by adjust_child_relids, something like attached.

That looks like a good solid solution. Pushed with a bit of
editorialization --- mostly, that I put the test case into
partition_join.sql where there's already suitable test tables.

Thanks for pushing it!

Thanks

Richard

Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)

От

Ashutosh Bapat

Дата:

28 июля 2023 г., 12:46:57

Hi Tom, Richard,

On Mon, Jul 24, 2023 at 8:17 AM Richard Guo <guofenglinux@gmail.com> wrote:
>
> Thanks for pushing it!

With this fix, I saw a noticeable increase in the memory consumption
of planner. I was running experiments mentioned in [1] The reason is
the Bitmapset created by bms_union() are not freed during planning and
when there are thousands of child joins involved, bitmapsets takes up
a large memory and there any a large number of bitmaps.

Attached 0002 patch fixes the memory consumption by calculating
appinfos only once and using them twice. The number look like below

Number of tables joined | without patch | with patch |
------------------------------------------------------
2 | 40.8 MiB | 40.3 MiB |
3 | 151.6 MiB | 146.9 MiB |
4 | 463.9 MiB | 445.5 MiB |
5 | 1663.9 MiB | 1563.3 MiB |

The memory consumption is prominent at higher number of joins as that
exponentially increases the number of child joins.

Attached setup.sql and queries.sql and patch 0001 were used to measure
memory similar to [1].

I don't think it's a problem with the patch itself. We should be able
to use Bitmapset APIs similar to what patch is doing. But there's a
problem with our Bitmapset implementation. It's not space efficient
for thousands of partitions. A quick calculation reveals this.

If the number of partitions is 1000, the matching partitions will
usually be 1000, 2000, 3000 and so on. Thus the bitmapset represnting
the relids will be {b 1000, 2000, 3000, ...}. To represent a single
1000th digit current Bitmapset implementation will allocate 1000/8 =
125 bytes of memory. A 5 way join will require 125 * 5 = 625 bytes of
memory. This is even true for lower relid numbers since they will be
1000 bits away e.g. (1, 1001, 2001, 3001, ...). So 1000 such child
joins require 625KiB memory. Doing this as many times as the number of
joins we get 120 * 625 KiB = 75 MiB which is closer to the memory
difference we see above.

Even if we allocate a list to hold 5 integers it will not take 625
bytes. I think we need to improve our Bitmapset representation to be
efficient in such cases. Of course whatever representation we choose
has to be space efficient for a small number of tables as well and
should gel well with our planner logic. So I guess some kind of
dynamic representation which changes the underlying layout based on
the contents is required. I have looked up past hacker threads to see
if this has been discussed previously.

I don't think this is the thread to discuss it and also I am not
planning to work on it in the immediate future. But I thought I would
mention it where I found it.

[1] https://www.postgresql.org/message-id/CAExHW5stmOUobE55pMt83r8UxvfCph+Pvo5dNpdrVCsBgXEzDQ@mail.gmail.com

--
Best Wishes,
Ashutosh Bapat

Вложения

Reuse child_relids in try_partitionwise_join was Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)

От

Ashutosh Bapat

Дата:

08 сентября 2023 г., 12:52:33

On Fri, Jul 28, 2023 at 3:16 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> Hi Tom, Richard,
>
> On Mon, Jul 24, 2023 at 8:17 AM Richard Guo <guofenglinux@gmail.com> wrote:
> >
> > Thanks for pushing it!
>
> With this fix, I saw a noticeable increase in the memory consumption
> of planner. I was running experiments mentioned in [1] The reason is
> the Bitmapset created by bms_union() are not freed during planning and
> when there are thousands of child joins involved, bitmapsets takes up
> a large memory and there any a large number of bitmaps.
>
> Attached 0002 patch fixes the memory consumption by calculating
> appinfos only once and using them twice. The number look like below
>
> Number of tables joined | without patch | with patch |
> ------------------------------------------------------
>                       2 |      40.8 MiB |   40.3 MiB |
>                       3 |     151.6 MiB |  146.9 MiB |
>                       4 |     463.9 MiB |  445.5 MiB |
>                       5 |    1663.9 MiB | 1563.3 MiB |
>
> The memory consumption is prominent at higher number of joins as that
> exponentially increases the number of child joins.
>
> Attached setup.sql and queries.sql and patch 0001 were used to measure
> memory similar to [1].
>
> I don't think it's a problem with the patch itself. We should be able
> to use Bitmapset APIs similar to what patch is doing. But there's a
> problem with our Bitmapset implementation. It's not space efficient
> for thousands of partitions. A quick calculation reveals this.
>
> If the number of partitions is 1000, the matching partitions will
> usually be 1000, 2000, 3000 and so on. Thus the bitmapset represnting
> the relids will be {b 1000, 2000, 3000, ...}. To represent a single
> 1000th digit current Bitmapset implementation will allocate 1000/8 =
> 125 bytes of memory. A 5 way join will require 125 * 5 = 625 bytes of
> memory. This is even true for lower relid numbers since they will be
> 1000 bits away e.g. (1, 1001, 2001, 3001, ...). So 1000 such child
> joins require 625KiB memory. Doing this as many times as the number of
> joins we get 120 * 625 KiB = 75 MiB which is closer to the memory
> difference we see above.
>
> Even if we allocate a list to hold 5 integers it will not take 625
> bytes. I think we need to improve our Bitmapset representation to be
> efficient in such cases. Of course whatever representation we choose
> has to be space efficient for a small number of tables as well and
> should gel well with our planner logic. So I guess some kind of
> dynamic representation which changes the underlying layout based on
> the contents is required. I have looked up past hacker threads to see
> if this has been discussed previously.
>
> I don't think this is the thread to discuss it and also I am not
> planning to work on it in the immediate future. But I thought I would
> mention it where I found it.
>
> [1] https://www.postgresql.org/message-id/CAExHW5stmOUobE55pMt83r8UxvfCph+Pvo5dNpdrVCsBgXEzDQ@mail.gmail.com
>

Adding this small patch to the commitfest in case somebody finds it
worth fixing this specific memory consumption. With a new subject.



--
Best Wishes,
Ashutosh Bapat

Re: Reuse child_relids in try_partitionwise_join was Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)

От

Ashutosh Bapat

Дата:

31 октября 2023 г., 08:36:48

On Fri, Sep 8, 2023 at 3:22 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Fri, Jul 28, 2023 at 3:16 PM Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
> >
> > Hi Tom, Richard,
> >
> > On Mon, Jul 24, 2023 at 8:17 AM Richard Guo <guofenglinux@gmail.com> wrote:
> > >
> > > Thanks for pushing it!
> >
> > With this fix, I saw a noticeable increase in the memory consumption
> > of planner. I was running experiments mentioned in [1] The reason is
> > the Bitmapset created by bms_union() are not freed during planning and
> > when there are thousands of child joins involved, bitmapsets takes up
> > a large memory and there any a large number of bitmaps.
> >
> > Attached 0002 patch fixes the memory consumption by calculating
> > appinfos only once and using them twice. The number look like below
> >
> > Number of tables joined | without patch | with patch |
> > ------------------------------------------------------
> >                       2 |      40.8 MiB |   40.3 MiB |
> >                       3 |     151.6 MiB |  146.9 MiB |
> >                       4 |     463.9 MiB |  445.5 MiB |
> >                       5 |    1663.9 MiB | 1563.3 MiB |
> >
> > The memory consumption is prominent at higher number of joins as that
> > exponentially increases the number of child joins.
> >
> > Attached setup.sql and queries.sql and patch 0001 were used to measure
> > memory similar to [1].
> >
> > I don't think it's a problem with the patch itself. We should be able
> > to use Bitmapset APIs similar to what patch is doing. But there's a
> > problem with our Bitmapset implementation. It's not space efficient
> > for thousands of partitions. A quick calculation reveals this.
> >
> > If the number of partitions is 1000, the matching partitions will
> > usually be 1000, 2000, 3000 and so on. Thus the bitmapset represnting
> > the relids will be {b 1000, 2000, 3000, ...}. To represent a single
> > 1000th digit current Bitmapset implementation will allocate 1000/8 =
> > 125 bytes of memory. A 5 way join will require 125 * 5 = 625 bytes of
> > memory. This is even true for lower relid numbers since they will be
> > 1000 bits away e.g. (1, 1001, 2001, 3001, ...). So 1000 such child
> > joins require 625KiB memory. Doing this as many times as the number of
> > joins we get 120 * 625 KiB = 75 MiB which is closer to the memory
> > difference we see above.
> >
> > Even if we allocate a list to hold 5 integers it will not take 625
> > bytes. I think we need to improve our Bitmapset representation to be
> > efficient in such cases. Of course whatever representation we choose
> > has to be space efficient for a small number of tables as well and
> > should gel well with our planner logic. So I guess some kind of
> > dynamic representation which changes the underlying layout based on
> > the contents is required. I have looked up past hacker threads to see
> > if this has been discussed previously.
> >
> > I don't think this is the thread to discuss it and also I am not
> > planning to work on it in the immediate future. But I thought I would
> > mention it where I found it.
> >
> > [1] https://www.postgresql.org/message-id/CAExHW5stmOUobE55pMt83r8UxvfCph+Pvo5dNpdrVCsBgXEzDQ@mail.gmail.com
> >
>
> Adding this small patch to the commitfest in case somebody finds it
> worth fixing this specific memory consumption. With a new subject.

Rebased patches.
0001 - is same as the squashed version of patches at
https://www.postgresql.org/message-id/CAExHW5sCJX7696sF-OnugAiaXS=Ag95=-m1cSrjcmyYj8Pduuw@mail.gmail.com.
0002 is the actual fix described earlier

--
Best Wishes,
Ashutosh Bapat

Вложения

Re: Reuse child_relids in try_partitionwise_join was Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)

От

Ashutosh Bapat

Дата:

06 ноября 2023 г., 12:18:18

explain.out in 0001 needed some adjustments. Without those CIbot shows
failures. Fixed in the attached patchset. 0001 is just for diagnosis,
not for actual commit. 0002 which is the actual patch has no changes
wrt to the previous version.

On Tue, Oct 31, 2023 at 11:06 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Fri, Sep 8, 2023 at 3:22 PM Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
> >
> > On Fri, Jul 28, 2023 at 3:16 PM Ashutosh Bapat
> > <ashutosh.bapat.oss@gmail.com> wrote:
> > >
> > > Hi Tom, Richard,
> > >
> > > On Mon, Jul 24, 2023 at 8:17 AM Richard Guo <guofenglinux@gmail.com> wrote:
> > > >
> > > > Thanks for pushing it!
> > >
> > > With this fix, I saw a noticeable increase in the memory consumption
> > > of planner. I was running experiments mentioned in [1] The reason is
> > > the Bitmapset created by bms_union() are not freed during planning and
> > > when there are thousands of child joins involved, bitmapsets takes up
> > > a large memory and there any a large number of bitmaps.
> > >
> > > Attached 0002 patch fixes the memory consumption by calculating
> > > appinfos only once and using them twice. The number look like below
> > >
> > > Number of tables joined | without patch | with patch |
> > > ------------------------------------------------------
> > >                       2 |      40.8 MiB |   40.3 MiB |
> > >                       3 |     151.6 MiB |  146.9 MiB |
> > >                       4 |     463.9 MiB |  445.5 MiB |
> > >                       5 |    1663.9 MiB | 1563.3 MiB |
> > >
> > > The memory consumption is prominent at higher number of joins as that
> > > exponentially increases the number of child joins.
> > >
> > > Attached setup.sql and queries.sql and patch 0001 were used to measure
> > > memory similar to [1].
> > >
> > > I don't think it's a problem with the patch itself. We should be able
> > > to use Bitmapset APIs similar to what patch is doing. But there's a
> > > problem with our Bitmapset implementation. It's not space efficient
> > > for thousands of partitions. A quick calculation reveals this.
> > >
> > > If the number of partitions is 1000, the matching partitions will
> > > usually be 1000, 2000, 3000 and so on. Thus the bitmapset represnting
> > > the relids will be {b 1000, 2000, 3000, ...}. To represent a single
> > > 1000th digit current Bitmapset implementation will allocate 1000/8 =
> > > 125 bytes of memory. A 5 way join will require 125 * 5 = 625 bytes of
> > > memory. This is even true for lower relid numbers since they will be
> > > 1000 bits away e.g. (1, 1001, 2001, 3001, ...). So 1000 such child
> > > joins require 625KiB memory. Doing this as many times as the number of
> > > joins we get 120 * 625 KiB = 75 MiB which is closer to the memory
> > > difference we see above.
> > >
> > > Even if we allocate a list to hold 5 integers it will not take 625
> > > bytes. I think we need to improve our Bitmapset representation to be
> > > efficient in such cases. Of course whatever representation we choose
> > > has to be space efficient for a small number of tables as well and
> > > should gel well with our planner logic. So I guess some kind of
> > > dynamic representation which changes the underlying layout based on
> > > the contents is required. I have looked up past hacker threads to see
> > > if this has been discussed previously.
> > >
> > > I don't think this is the thread to discuss it and also I am not
> > > planning to work on it in the immediate future. But I thought I would
> > > mention it where I found it.
> > >
> > > [1] https://www.postgresql.org/message-id/CAExHW5stmOUobE55pMt83r8UxvfCph+Pvo5dNpdrVCsBgXEzDQ@mail.gmail.com
> > >
> >
> > Adding this small patch to the commitfest in case somebody finds it
> > worth fixing this specific memory consumption. With a new subject.
>
> Rebased patches.
> 0001 - is same as the squashed version of patches at
> https://www.postgresql.org/message-id/CAExHW5sCJX7696sF-OnugAiaXS=Ag95=-m1cSrjcmyYj8Pduuw@mail.gmail.com.
> 0002 is the actual fix described earlier
>
> --
> Best Wishes,
> Ashutosh Bapat



--
Best Wishes,
Ashutosh Bapat

Вложения

Re: Reuse child_relids in try_partitionwise_join was Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)

От

Ashutosh Bapat

Дата:

06 февраля, 16:07:20

On Mon, Nov 6, 2023 at 2:48 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> explain.out in 0001 needed some adjustments. Without those CIbot shows
> failures. Fixed in the attached patchset. 0001 is just for diagnosis,
> not for actual commit. 0002 which is the actual patch has no changes
> wrt to the previous version.
>

First patch is no longer required. PFA rebased second patch.

--
Best Wishes,
Ashutosh Bapat

Вложения

0001-Reuse-child_relids-bitmapset-in-partitionwi-20240206.patch

Re: Reuse child_relids in try_partitionwise_join was Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)

От

David Christensen

Дата:

30 мая, 23:49:07

Hello,

I am looking through some outstanding CommitFest entries; I wonder if
this particular patch is already effectively fixed by 5278d0a2, which
is both attributed to the original author as well as an extremely
similar approach.  Can this entry
(https://commitfest.postgresql.org/48/4553/) be closed?

Best,

David

Re: Reuse child_relids in try_partitionwise_join was Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)

От

Ashutosh Bapat

Дата:

05 июня, 10:47:56

Hi David,

Thanks for looking into this.

On Fri, May 31, 2024 at 2:19 AM David Christensen <david+pg@pgguru.net> wrote:

Hello,

I am looking through some outstanding CommitFest entries; I wonder if
this particular patch is already effectively fixed by 5278d0a2, which
is both attributed to the original author as well as an extremely
similar approach. Can this entry
(https://commitfest.postgresql.org/48/4553/) be closed?

This is different. But it needs a rebase. PFA rebased patch. The revised commit message explains the change better.

Here are numbers revised after 5278d0a2. Since the code changes only affect partitionwise join code path, I am reporting only partition wise join numbers. The first column reports the number of joining relations, each having 1000 partitions. Rest of the column names are self-explanatory.

Planning memory used:

num_joins | master | patched | memory saved | memory saved
-----------+---------+---------+--------------+----------------------------
2 | 31 MB | 30 MB | 525 kB | 1.68%
3 | 111 MB | 107 MB | 4588 kB | 4.03%
4 | 339 MB | 321 MB | 17 MB | 5.13%
5 | 1062 MB | 967 MB | 95 MB | 8.96%

Here's planning time measurements.

num_joins | master (ms) | patched (ms) | change in planning time (ms) | change in planning time
-----------+-------------+--------------+------------------------------+---------------------------------------
2 | 187.86 | 177.27 | 10.59 | 5.64%
3 | 721.81 | 758.80 | -36.99 | -5.12%
4 | 2239.87 | 2236.19 | 3.68 | 0.16%
5 | 6830.86 | 7027.76 | -196.90 | -2.88%

The patch calls find_appinfos_by_relids() only once (instead of twice), calls bms_union() on child relids only once, allocates and deallocates appinfos array only once saving some CPU cycles but spends some CPU cycles in bms_free(). There's expected to be a net small saving in planning time. But above numbers show planning time increases in some cases and decreases in some cases. Repeating the experiment multiple times does not show a stable increase or decrease. But the variations all being within noise levels. We could conclude that the patch does not affect planning time adversely.

The scripts I used originally [1] need a revision too since EXPLAIN MEMORY output has changed. PFA the scripts. setup.sql creates the required functions and tables. queries.sql EXPLAINs queries with different number of joining relations collecting planning time and memory numbers in a table (msmts). We need to change psql variable code to 'patched' or 'master' when using master + patches or master branch respectively. Following queries are used to report above measurements from msmt.

planning memory: select p.num_joins, pg_size_pretty(m.average * 1024) "master", pg_size_pretty(p.average * 1024) as "patched", pg_size_pretty((m.average - p.average) * 1024) "memory saved", ((m.average - p.average) * 100/m.average)::numeric(42, 2) "memory saved in percentage" from msmts p, msmts m where p.num_joins = m.num_joins and p.variant = m.variant and p.code = 'patched' and m.code = 'master' and p.measurement = m.measurement and p.measurement = 'planning memory' and p.variant = 'pwj' order by num_joins;

planning time: select p.num_joins, m.average "master (ms)", p.average "patched (ms)", m.average - p.average "change in planning time (ms)", ((m.average - p.

average) * 100/m.average)::numeric(42, 2) "change in planning time as percentage" from msmts p, msmts m where p.num_joins = m.num_joins and p.variant = m.var
iant and p.code = 'patched' and m.code = 'master' and p.measurement = m.measurement and p.measurement = 'planning time' and p.variant = 'pwj' order by p.vari
ant, num_joins;

[1] https://www.postgresql.org/message-id/CAExHW5snUW7pD2RdtaBa1T_TqJYaY6W_YPVjWDrgSf33i-0uqA%40mail.gmail.com

Best Wishes,

Ashutosh Bapat

Вложения

Re: Reuse child_relids in try_partitionwise_join was Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)

От

Ashutosh Bapat

Дата:

06 июня, 08:01:06

On Wed, Jun 5, 2024 at 1:17 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:

Hi David,
Thanks for looking into this.

On Fri, May 31, 2024 at 2:19 AM David Christensen <david+pg@pgguru.net> wrote:
Hello,

I am looking through some outstanding CommitFest entries; I wonder if
this particular patch is already effectively fixed by 5278d0a2, which
is both attributed to the original author as well as an extremely
similar approach. Can this entry
(https://commitfest.postgresql.org/48/4553/) be closed?

This is different. But it needs a rebase. PFA rebased patch. The revised commit message explains the change better.

Here are numbers revised after 5278d0a2. Since the code changes only affect partitionwise join code path, I am reporting only partition wise join numbers. The first column reports the number of joining relations, each having 1000 partitions. Rest of the column names are self-explanatory.

Planning memory used:
num_joins | master | patched | memory saved | memory saved
-----------+---------+---------+--------------+----------------------------
2 | 31 MB | 30 MB | 525 kB | 1.68%
3 | 111 MB | 107 MB | 4588 kB | 4.03%
4 | 339 MB | 321 MB | 17 MB | 5.13%
5 | 1062 MB | 967 MB | 95 MB | 8.96%

Here's planning time measurements.
num_joins | master (ms) | patched (ms) | change in planning time (ms) | change in planning time
-----------+-------------+--------------+------------------------------+---------------------------------------
2 | 187.86 | 177.27 | 10.59 | 5.64%
3 | 721.81 | 758.80 | -36.99 | -5.12%
4 | 2239.87 | 2236.19 | 3.68 | 0.16%
5 | 6830.86 | 7027.76 | -196.90 | -2.88%

Here are some numbers with lower number of partitions per relation

For 100 partitions

Planning memory:

num_joins | master | patched | memory saved | memory saved in percentage
-----------+---------+---------+---------------+----------------------------
2 | 2820 kB | 2812 kB | 8192.00 bytes | 0.28%
3 | 9335 kB | 9270 kB | 65 kB | 0.70%
4 | 27 MB | 27 MB | 247 kB | 0.90%
5 | 78 MB | 77 MB | 1101 kB | 1.37%

Planning time:

num_joins | master (ms) | patched (ms) | change in planning time (ms) | change in planning time as percentage
-----------+-------------+--------------+------------------------------+---------------------------------------
2 | 3.33 | 3.21 | 0.12 | 3.60%
3 | 11.40 | 11.17 | 0.23 | 2.02%
4 | 37.61 | 37.56 | 0.05 | 0.13%
5 | 124.39 | 126.08 | -1.69 | -1.36%

For 10 partitions

Planning memory:

num_joins | master | patched | memory saved | memory saved in percentage
-----------+---------+---------+---------------+----------------------------
2 | 310 kB | 310 kB | 0.00 bytes | 0.00
3 | 992 kB | 989 kB | 3072.00 bytes | 0.30
4 | 2891 kB | 2883 kB | 8192.00 bytes | 0.28
5 | 8317 kB | 8290 kB | 27 kB | 0.32

Planning time:

num_joins | master (ms) | patched (ms) | change in planning time (ms) | change in planning time as percentage
-----------+-------------+--------------+------------------------------+---------------------------------------
2 | 0.42 | 0.37 | 0.05 | 11.90
3 | 1.08 | 1.09 | -0.01 | -0.93
4 | 3.56 | 3.32 | 0.24 | 6.74
5 | 8.86 | 8.78 | 0.08 | 0.90

Observations:

1. As expected the memory savings are smaller for a small number of partitions, but they are not negative. Thus the patch helps in case of a large number of partitions but does not adversely affect a small number of partitions

2. Planning time changes are within noise. But usually I see that the planning time varies a lot. Thus the numbers here just indicate that the planning times are not adversely affected by the change. Theoretically, I would expect the patch to improve planning time as explained in the earlier email.

3. As mentioned in the earlier email, the patch affects only the partitionwise join code path hence I am reporting numbers with partitionwise join enabled.

Best Wishes,

Ashutosh Bapat

Re: Reuse child_relids in try_partitionwise_join was Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)

От

Robert Haas

Дата:

06 июня, 19:30:39

On Wed, Jun 5, 2024 at 3:48 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
> Here's planning time measurements.
>  num_joins | master (ms) | patched (ms) | change in planning time (ms) | change in planning time
> -----------+-------------+--------------+------------------------------+---------------------------------------
>          2 |      187.86 |       177.27 |                        10.59 |                                  5.64%
>          3 |      721.81 |       758.80 |                       -36.99 |                                 -5.12%
>          4 |     2239.87 |      2236.19 |                         3.68 |                                  0.16%
>          5 |     6830.86 |      7027.76 |                      -196.90 |                                 -2.88%

I find these results concerning. Why should the planning time
sometimes go up? And why should it go up for odd numbers of joinrels
and down for even numbers of joinrels? I wonder if these results are
really correct, and if they are, I wonder what could account for such
odd behavior

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Reuse child_relids in try_partitionwise_join was Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)

От

Ashutosh Bapat

Дата:

10 июня, 10:08:54

On Thu, Jun 6, 2024 at 10:00 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jun 5, 2024 at 3:48 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
> Here's planning time measurements.
> num_joins | master (ms) | patched (ms) | change in planning time (ms) | change in planning time
> -----------+-------------+--------------+------------------------------+---------------------------------------
> 2 | 187.86 | 177.27 | 10.59 | 5.64%
> 3 | 721.81 | 758.80 | -36.99 | -5.12%
> 4 | 2239.87 | 2236.19 | 3.68 | 0.16%
> 5 | 6830.86 | 7027.76 | -196.90 | -2.88%

I find these results concerning. Why should the planning time
sometimes go up? And why should it go up for odd numbers of joinrels
and down for even numbers of joinrels? I wonder if these results are
really correct, and if they are, I wonder what could account for such
odd behavior

This is just one instance of measurements. If I run the experiment multiple times the results and the patterns will vary. Usually I have found planning time to vary within 5% for regular tables and within 9% for partitioned tables with a large number of partitions. Below are measurements with the experiment repeated multiple times. For a given number of partitioned tables (each with 1000 partitions) being joined, planning time is measured 10 consecutive times. For this set of 10 runs we calculate average and standard deviation of planning time. Such 10 sets are sampled. This means planning time is sampled 100 times in total with and without patch respectively. Measurements with master and patched are reported in the attached excel sheet.

Observations from spreadsheet

1. Differences in planning time with master or patched version are within standard deviation. The difference is both ways. Indicating that the patch doesn't affect planning time adversely.

2. If we just look at the 5-way join numbers, there are more +ve s in column E and sometimes higher than standard deviation. Indicating that the patch improves planning time when there are higher number of partitioned tables being joined.

3. If we just look at the 5-way join numbers, the patch seems to reduce the standard deviation (column B vs column D) indicating that the planning time is more stable and predictable with patched version.

3. There is no bias towards even or odd numbers of partitioned tables being joined.

Looking at the memory savings numbers reported earlier, the patch improves memory consumption of partitionwise join without adversely affecting planning time. It doesn't affect the code paths which do not use partitionwise join. That's overall net positive impact for relatively less code churn.

Best Wishes,

Ashutosh Bapat

Вложения

child_bms_pwj_planning_time.ods

Re: Reuse child_relids in try_partitionwise_join was Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)

От

Robert Haas

Дата:

10 июня, 16:14:59

On Mon, Jun 10, 2024 at 3:09 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
> This is just one instance of measurements. If I run the experiment multiple times the results and the patterns will
vary.Usually I have found planning time to vary within 5% for regular tables and within 9% for partitioned tables with
alarge number of partitions. Below are measurements with the experiment repeated multiple times. For a given number of
partitionedtables (each with 1000 partitions) being joined, planning time is measured 10 consecutive times. For this
setof 10 runs we calculate average and standard deviation of planning time. Such 10 sets are sampled. This means
planningtime is sampled 100 times in total with and without patch respectively. Measurements with master and patched
arereported in the attached excel sheet. 

Well, this is fine then I guess, but if the original results weren't
stable enough for people to draw conclusions from, then it's better
not to post them, and instead do this work to get results that are
stable before posting.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Reuse child_relids in try_partitionwise_join was Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)

От

David Christensen

Дата:

12 июня, 01:52:43

On Mon, Jun 10, 2024 at 8:15 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Jun 10, 2024 at 3:09 AM Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
> > This is just one instance of measurements. If I run the experiment multiple times the results and the patterns will
vary.Usually I have found planning time to vary within 5% for regular tables and within 9% for partitioned tables with
alarge number of partitions. Below are measurements with the experiment repeated multiple times. For a given number of
partitionedtables (each with 1000 partitions) being joined, planning time is measured 10 consecutive times. For this
setof 10 runs we calculate average and standard deviation of planning time. Such 10 sets are sampled. This means
planningtime is sampled 100 times in total with and without patch respectively. Measurements with master and patched
arereported in the attached excel sheet. 
>
> Well, this is fine then I guess, but if the original results weren't
> stable enough for people to draw conclusions from, then it's better
> not to post them, and instead do this work to get results that are
> stable before posting.

Just doing a quick code review of the structure and the caller, I'd
agree that this is properly hoisting the invariant, so don't see that
it should contribute to any performance regressions.  To the extent
that it's called multiple times I can see that it's an improvement,
with minimal code shuffling it seems like a sensible change (even in
the single-caller case).

In short +1 from me.

David

Re: Reuse child_relids in try_partitionwise_join was Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)

От

Ashutosh Bapat

Дата:

01 июля, 08:55:56

On Wed, Jun 12, 2024 at 4:22 AM David Christensen <david+pg@pgguru.net> wrote:

On Mon, Jun 10, 2024 at 8:15 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Jun 10, 2024 at 3:09 AM Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
> > This is just one instance of measurements. If I run the experiment multiple times the results and the patterns will vary. Usually I have found planning time to vary within 5% for regular tables and within 9% for partitioned tables with a large number of partitions. Below are measurements with the experiment repeated multiple times. For a given number of partitioned tables (each with 1000 partitions) being joined, planning time is measured 10 consecutive times. For this set of 10 runs we calculate average and standard deviation of planning time. Such 10 sets are sampled. This means planning time is sampled 100 times in total with and without patch respectively. Measurements with master and patched are reported in the attached excel sheet.
>
> Well, this is fine then I guess, but if the original results weren't
> stable enough for people to draw conclusions from, then it's better
> not to post them, and instead do this work to get results that are
> stable before posting.

Just doing a quick code review of the structure and the caller, I'd
agree that this is properly hoisting the invariant, so don't see that
it should contribute to any performance regressions. To the extent
that it's called multiple times I can see that it's an improvement,
with minimal code shuffling it seems like a sensible change (even in
the single-caller case).

In short +1 from me.

Hi David,

Do you think it needs more review or we can change it to "ready for committer"?

Best Wishes,

Ashutosh Bapat

Re: Reuse child_relids in try_partitionwise_join was Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)

От

Richard Guo

Дата:

23 июля, 11:35:44

On Wed, Jun 5, 2024 at 3:48 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
> This is different. But it needs a rebase. PFA rebased patch. The revised commit message explains the change better.

I looked through this patch and I think it is in a good shape.

Some minor comments:

* I think it's better to declare 'child_relids' as type Relids
rather than Bitmapset *.  While they are the same thing, Bitmapset *
isn't typically used in joinrels.c.  And I doubt that it is necessary
to explicitly initialize it to NULL.

* This patch changes the signature of function build_child_join_rel().
I recommend arranging the two newly added parameters as 'int
nappinfos, AppendRelInfo **appinfos' to maintain consistency with
other functions that include these parameters.

* At the end of the for loop in try_partitionwise_join(), we attempt
to free the variables used within the loop.  I suggest freeing these
variables in the order or reverse order of their allocation, rather
than arbitrarily.  Also, in non-partitionwise-join planning, we do not
free local variables in this manner.  I understand that we do this for
partitionwise-join because accumulating local variables can lead to
significant memory usage, particularly with many partitions.  I think
it would be beneficial to add a comment in try_partitionwise_join()
explaining this behavior.

Thanks
Richard

Re: Reuse child_relids in try_partitionwise_join was Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)

От

Ashutosh Bapat

Дата:

24 июля, 10:30:23

On Tue, Jul 23, 2024 at 2:05 PM Richard Guo <guofenglinux@gmail.com> wrote:
>
> On Wed, Jun 5, 2024 at 3:48 PM Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
> > This is different. But it needs a rebase. PFA rebased patch. The revised commit message explains the change better.
>
> I looked through this patch and I think it is in a good shape.
>
> Some minor comments:
>
> * I think it's better to declare 'child_relids' as type Relids
> rather than Bitmapset *.  While they are the same thing, Bitmapset *
> isn't typically used in joinrels.c.

Agreed. Sorry. Fixed now.

> And I doubt that it is necessary
> to explicitly initialize it to NULL.

Done.

>
> * This patch changes the signature of function build_child_join_rel().
> I recommend arranging the two newly added parameters as 'int
> nappinfos, AppendRelInfo **appinfos' to maintain consistency with
> other functions that include these parameters.

Yes. Done.

>
> * At the end of the for loop in try_partitionwise_join(), we attempt
> to free the variables used within the loop.  I suggest freeing these
> variables in the order or reverse order of their allocation, rather
> than arbitrarily.

Done. Are you suggesting it for aesthetic purposes or there's
something else (read, less defragmentation, performance gain etc.)? I
am curious.

> Also, in non-partitionwise-join planning, we do not
> free local variables in this manner.  I understand that we do this for
> partitionwise-join because accumulating local variables can lead to
> significant memory usage, particularly with many partitions.  I think
> it would be beneficial to add a comment in try_partitionwise_join()
> explaining this behavior.

Sounds useful. Done.

PFA patches:
0001 - same as previous one
0002 - addresses your review comments

--
Best Wishes,
Ashutosh Bapat

Вложения

Re: Reuse child_relids in try_partitionwise_join was Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)

От

Richard Guo

Дата:

26 июля, 12:44:33

On Wed, Jul 24, 2024 at 3:30 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
> Done. Are you suggesting it for aesthetic purposes or there's
> something else (read, less defragmentation, performance gain etc.)? I
> am curious.

I think it's just for readability.

> PFA patches:
> 0001 - same as previous one
> 0002 - addresses your review comments

I've worked a bit more on the comments and commit message, and I plan
to push the attached soon, barring any objections or comments.

Thanks
Richard

Вложения

v1-0001-Reduce-memory-used-by-partitionwise-joins.patch

Re: Reuse child_relids in try_partitionwise_join was Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)

От

Richard Guo

Дата:

29 июля, 06:26:09

On Fri, Jul 26, 2024 at 5:44 PM Richard Guo <guofenglinux@gmail.com> wrote:
> I've worked a bit more on the comments and commit message, and I plan
> to push the attached soon, barring any objections or comments.

Pushed.

Thanks
Richard

Re: Reuse child_relids in try_partitionwise_join was Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)

От

Ashutosh Bapat

Дата:

29 июля, 06:44:16

Thanks a lot Richard.

On Mon, Jul 29, 2024 at 8:56 AM Richard Guo <guofenglinux@gmail.com> wrote:
>
> On Fri, Jul 26, 2024 at 5:44 PM Richard Guo <guofenglinux@gmail.com> wrote:
> > I've worked a bit more on the comments and commit message, and I plan
> > to push the attached soon, barring any objections or comments.
>
> Pushed.
>
> Thanks
> Richard



--
Best Wishes,
Ashutosh Bapat

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения