Обсуждение: BUG #18377: Assert false in "partdesc->nparts >= pinfo->nparts", fileName="execPartition.c", lineNumber=1943

Поиск
Список
Период
Сортировка
The following bug has been logged on the website:

Bug reference:      18377
Logged by:          yajun Hu
Email address:      1026592243@qq.com
PostgreSQL version: 14.11
Operating system:   CentOS7 with kernel version 5.10
Description:

I have reproduced this problem in REL_14_11 and the latest master branch
(393b5599e5177e456cdce500039813629d370b38).
The steps to reproduce are as follows.
1. ./configure  --enable-debug --enable-depend --enable-cassert CFLAGS=-O0
2. make -j; make install -j; initdb -D ./primary; pg_ctl -D ../primary -l
logfile start
3. alter system set plan_cache_mode to 'force_generic_plan' ; select
pg_reload_conf();
4. create table p( a int,b int) partition by range(a);create table p1
partition of p for values from (0) to (1);create table p2 partition of p for
values from (1) to (2);
5. use pgbench with following SQL
1.sql
SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
\if :gotlock
        alter table p detach partition p1 concurrently ;
        alter table p attach partition p1 for values from (0) to (1);
\endif
2.sql
\set id random(0,1)
select * from p where a = :id
pgbench --no-vacuum --client=5 --transactions=1000000 -f 1.sql -f 2.sql -h
127.0.0.1 -M prepared -p 5432

I tested that coredump will appear within 10 seconds, and the stack is as
follows:
(gdb) bt
#0  0x00007effe61aa277 in raise () from /lib64/libc.so.6
#1  0x00007effe61ab968 in abort () from /lib64/libc.so.6
#2  0x0000000000b8748d in ExceptionalCondition (conditionName=0xd25358
"partdesc->nparts >= pinfo->nparts", fileName=0xd24cfc "execPartition.c",
lineNumber=1943) at assert.c:66
#3  0x0000000000748bf1 in CreatePartitionPruneState (planstate=0x1898ad0,
pruneinfo=0x1884188) at execPartition.c:1943
#4  0x00000000007488cb in ExecInitPartitionPruning (planstate=0x1898ad0,
n_total_subplans=2, pruneinfo=0x1884188,
initially_valid_subplans=0x7ffdca29f7d0) at execPartition.c:1803
#5  0x000000000076171d in ExecInitAppend (node=0x17cb5e0, estate=0x1898870,
eflags=32) at nodeAppend.c:146
#6  0x00000000007499af in ExecInitNode (node=0x17cb5e0, estate=0x1898870,
eflags=32) at execProcnode.c:182
#7  0x000000000073f514 in InitPlan (queryDesc=0x1880f58, eflags=32) at
execMain.c:969
#8  0x000000000073e3ea in standard_ExecutorStart (queryDesc=0x1880f58,
eflags=32) at execMain.c:267
#9  0x000000000073e15f in ExecutorStart (queryDesc=0x1880f58, eflags=0) at
execMain.c:146
#10 0x00000000009cab67 in PortalStart (portal=0x181b000, params=0x1880ec8,
eflags=0, snapshot=0x0) at pquery.c:517
#11 0x00000000009c5e49 in exec_bind_message (input_message=0x7ffdca29fd20)
at postgres.c:2028
#12 0x00000000009c940e in PostgresMain (dbname=0x17d8e88 "postgres",
username=0x17d8e68 "postgres") at postgres.c:4723
#13 0x00000000008f8024 in BackendRun (port=0x17cd640) at postmaster.c:4477
#14 0x00000000008f77c1 in BackendStartup (port=0x17cd640) at
postmaster.c:4153
#15 0x00000000008f4256 in ServerLoop () at postmaster.c:1771
#16 0x00000000008f3c40 in PostmasterMain (argc=3, argv=0x179b680) at
postmaster.c:1470
#17 0x00000000007be309 in main (argc=3, argv=0x179b680) at main.c:198
(gdb) f 3
#3  0x0000000000748bf1 in CreatePartitionPruneState (planstate=0x1898ad0,
pruneinfo=0x1884188) at execPartition.c:1943
1943                            Assert(partdesc->nparts >= pinfo->nparts);
(gdb) p partdesc->nparts
$1 = 1
(gdb) p pinfo->nparts
$2 = 2

I tried to analyze this problem and found the following:
1. ERROR "could not match partition child tables to plan elements" will be
thrown in release mode
2. When Assert is false, the number of partitions in the plan is 2, but
detach concurrently has been submitted at this time, resulting in only 1
partition viewed during execution, which violates the design principle.
3. Perhaps detach concurrently should have a more appropriate waiting
mechanism or planning should exclude partitions in this scenario.


On Tue, 5 Mar 2024 at 20:17, PG Bug reporting form
<noreply@postgresql.org> wrote:
> 1943                            Assert(partdesc->nparts >= pinfo->nparts);
> (gdb) p partdesc->nparts
> $1 = 1
> (gdb) p pinfo->nparts
> $2 = 2

The following comment says this shouldn't happen, but apparently it can:

* Because we request detached partitions to be included, and
* detaching waits for old transactions, it is safe to assume that
* no partitions have disappeared since this query was planned.

Added Álvaro as he knows this area best.

David



On 2024-Mar-06, David Rowley wrote:

> On Tue, 5 Mar 2024 at 20:17, PG Bug reporting form
> <noreply@postgresql.org> wrote:
> > 1943                            Assert(partdesc->nparts >= pinfo->nparts);
> > (gdb) p partdesc->nparts
> > $1 = 1
> > (gdb) p pinfo->nparts
> > $2 = 2
> 
> The following comment says this shouldn't happen, but apparently it can:
> 
> * Because we request detached partitions to be included, and
> * detaching waits for old transactions, it is safe to assume that
> * no partitions have disappeared since this query was planned.
> 
> Added Álvaro as he knows this area best.

Oh, ugh, apologies, I hadn't noticed this report.  I'll have a look ...

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/



On 2024-Mar-05, PG Bug reporting form wrote:

> #2  0x0000000000b8748d in ExceptionalCondition (conditionName=0xd25358
> "partdesc->nparts >= pinfo->nparts", fileName=0xd24cfc "execPartition.c",
> lineNumber=1943) at assert.c:66
> #3  0x0000000000748bf1 in CreatePartitionPruneState (planstate=0x1898ad0,
> pruneinfo=0x1884188) at execPartition.c:1943
> #4  0x00000000007488cb in ExecInitPartitionPruning (planstate=0x1898ad0,
> n_total_subplans=2, pruneinfo=0x1884188,
> initially_valid_subplans=0x7ffdca29f7d0) at execPartition.c:1803

I had been digging into this crash in late March and seeing if I could
find a reliable fix, but it seems devilish and had to put it aside.  The
problem is that DETACH CONCURRENTLY does a wait for snapshots to
disappear before doing the next detach phase; but since pgbench is using
prepared mode, the wait is already long done by the time EXECUTE wants
to run the plan.  Now, we have relcache invalidations at the point where
the wait ends, and those relcache invalidations should in turn cause the
prepared plan to be invalidated, so we would get a new plan that
excludes the partition being detached.  But this doesn't happen for some
reason that I haven't yet been able to understand.

Still trying to find a proper fix.  In the meantime, not using prepared
plans should serve to work around the problem.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"The ability of users to misuse tools is, of course, legendary" (David Steele)
https://postgr.es/m/11b38a96-6ded-4668-b772-40f992132797@pgmasters.net





Alvaro Herrera <alvherre@alvh.no-ip.org> 于2024年4月9日周二 01:57写道:
On 2024-Mar-05, PG Bug reporting form wrote:

> #2  0x0000000000b8748d in ExceptionalCondition (conditionName=0xd25358
> "partdesc->nparts >= pinfo->nparts", fileName=0xd24cfc "execPartition.c",
> lineNumber=1943) at assert.c:66
> #3  0x0000000000748bf1 in CreatePartitionPruneState (planstate=0x1898ad0,
> pruneinfo=0x1884188) at execPartition.c:1943
> #4  0x00000000007488cb in ExecInitPartitionPruning (planstate=0x1898ad0,
> n_total_subplans=2, pruneinfo=0x1884188,
> initially_valid_subplans=0x7ffdca29f7d0) at execPartition.c:1803

I had been digging into this crash in late March and seeing if I could
find a reliable fix, but it seems devilish and had to put it aside.  The
problem is that DETACH CONCURRENTLY does a wait for snapshots to
disappear before doing the next detach phase; but since pgbench is using
prepared mode, the wait is already long done by the time EXECUTE wants
to run the plan.  Now, we have relcache invalidations at the point where
the wait ends, and those relcache invalidations should in turn cause the
prepared plan to be invalidated, so we would get a new plan that
excludes the partition being detached.  But this doesn't happen for some
reason that I haven't yet been able to understand.

Still trying to find a proper fix.  In the meantime, not using prepared
plans should serve to work around the problem.

--
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"The ability of users to misuse tools is, of course, legendary" (David Steele)
https://postgr.es/m/11b38a96-6ded-4668-b772-40f992132797@pgmasters.net



I had been analying this crash these days.  And I added a lot debug infos in codes.
Finally, I found a code execution sequence that would trigger this assert, and I could
use gdb not pgbench to help to reproduce this crash.

For example:
./psql postgres  # as session1 to do detach, start first
in another terminal, start gdb(call gdb1)
   gdb -p session1_pid
   b ATExecDetachPartition

in session1,  input alter table p detach partition p1 concurrently;
now session1 will be stalled by gdb.

in gdb terminal, we input step next(e.g. n) until first transaction call CommitTransactionCommand().
wo stop at CommitTransactionCommand().

we start another session2 to do select.
input : prepare p1 as select * from p where a = $1;

we start a new terminal, start gdb(call gdb2)
    gdb -p session2_pid
    b exec_simple_query
in session2, input execute p1(1);
Now session2 will be stalled by gdb.

in gdb terminal, we step into PortalRunUtility(), after getting a snapshot, we stop here.
For session2, the transaction updating pg_inherits is not commited.
We switch to gdb1 terminal, and continue to step next until calling DetachPartitionFinalize().
Because session2 has not get p relaiton lock, so in gdb1, we can cross WaitForLockersMultiple().

Now we swithch to gdb2, and continue to do work. If we breakpoint find_inheritance_children_extended()
We will get a tuple that inhdetachpending is true, but the xmin is in-progress for the session2 snapshot.
So this tuple will be added to the outpue according to the logic. Finally we will get two parts.
After return from add_base_rels_to_query() in query_planner(), we switch to gdb1.

In gdb1, we enter DetachPartitionFinalize() and call RemoveInheritance() to remove the tuple.
We input command "continue" to do left work for the detach.

Now we switch to gdb2, breakpoint at RelationCacheInvalidateEntry(). We continue gdb2, and we will
stop at RelationCacheInvalidateEntry(). And we will see that p relation cache item will be cleared. 
The backtrace will be attached at the end of the this email.

Entering ExecInitAppend(), because part_prune_info is not null, so we will enter CreatePartitionPruneState().
We enter find_inheritance_children_extended() again to get partdesc, but in gdb1  we have done DetachPartitionFinalize()
and the detach has commited. So we only get one tuple and parts is 1.

Finally, we will trigger the Assert: (partdesc->nparts >= pinfo->nparts).


--
Tender Wang
OpenPie:  https://en.openpie.com/


Tender Wang <tndrwang@gmail.com> 于2024年4月18日周四 20:13写道:


Alvaro Herrera <alvherre@alvh.no-ip.org> 于2024年4月9日周二 01:57写道:
On 2024-Mar-05, PG Bug reporting form wrote:

> #2  0x0000000000b8748d in ExceptionalCondition (conditionName=0xd25358
> "partdesc->nparts >= pinfo->nparts", fileName=0xd24cfc "execPartition.c",
> lineNumber=1943) at assert.c:66
> #3  0x0000000000748bf1 in CreatePartitionPruneState (planstate=0x1898ad0,
> pruneinfo=0x1884188) at execPartition.c:1943
> #4  0x00000000007488cb in ExecInitPartitionPruning (planstate=0x1898ad0,
> n_total_subplans=2, pruneinfo=0x1884188,
> initially_valid_subplans=0x7ffdca29f7d0) at execPartition.c:1803

I had been digging into this crash in late March and seeing if I could
find a reliable fix, but it seems devilish and had to put it aside.  The
problem is that DETACH CONCURRENTLY does a wait for snapshots to
disappear before doing the next detach phase; but since pgbench is using
prepared mode, the wait is already long done by the time EXECUTE wants
to run the plan.  Now, we have relcache invalidations at the point where
the wait ends, and those relcache invalidations should in turn cause the
prepared plan to be invalidated, so we would get a new plan that
excludes the partition being detached.  But this doesn't happen for some
reason that I haven't yet been able to understand.

Still trying to find a proper fix.  In the meantime, not using prepared
plans should serve to work around the problem.

--
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"The ability of users to misuse tools is, of course, legendary" (David Steele)
https://postgr.es/m/11b38a96-6ded-4668-b772-40f992132797@pgmasters.net



I had been analying this crash these days.  And I added a lot debug infos in codes.
Finally, I found a code execution sequence that would trigger this assert, and I could
use gdb not pgbench to help to reproduce this crash.

For example:
./psql postgres  # as session1 to do detach, start first
in another terminal, start gdb(call gdb1)
   gdb -p session1_pid
   b ATExecDetachPartition

in session1,  input alter table p detach partition p1 concurrently;
now session1 will be stalled by gdb.

in gdb terminal, we input step next(e.g. n) until first transaction call CommitTransactionCommand().
wo stop at CommitTransactionCommand().

we start another session2 to do select.
input : prepare p1 as select * from p where a = $1;

we start a new terminal, start gdb(call gdb2)
    gdb -p session2_pid
    b exec_simple_query
in session2, input execute p1(1);
Now session2 will be stalled by gdb.

in gdb terminal, we step into PortalRunUtility(), after getting a snapshot, we stop here.
For session2, the transaction updating pg_inherits is not commited.
We switch to gdb1 terminal, and continue to step next until calling DetachPartitionFinalize().
Because session2 has not get p relaiton lock, so in gdb1, we can cross WaitForLockersMultiple().

Now we swithch to gdb2, and continue to do work. If we breakpoint find_inheritance_children_extended()
We will get a tuple that inhdetachpending is true, but the xmin is in-progress for the session2 snapshot.
So this tuple will be added to the outpue according to the logic. Finally we will get two parts.
After return from add_base_rels_to_query() in query_planner(), we switch to gdb1.

In gdb1, we enter DetachPartitionFinalize() and call RemoveInheritance() to remove the tuple.
We input command "continue" to do left work for the detach.

Now we switch to gdb2, breakpoint at RelationCacheInvalidateEntry(). We continue gdb2, and we will
stop at RelationCacheInvalidateEntry(). And we will see that p relation cache item will be cleared. 
The backtrace will be attached at the end of the this email.

Entering ExecInitAppend(), because part_prune_info is not null, so we will enter CreatePartitionPruneState().
We enter find_inheritance_children_extended() again to get partdesc, but in gdb1  we have done DetachPartitionFinalize()
and the detach has commited. So we only get one tuple and parts is 1.

Finally, we will trigger the Assert: (partdesc->nparts >= pinfo->nparts).


--
Tender Wang
OpenPie:  https://en.openpie.com/

Sorry, I forgot to put backtrace that call RelationCacheInvalidateEntry() in planner phase in last email.

I found one self-contradiction comments in CreatePartitionPruneState():

/* For data reading, executor always omits detached partitions */ 
if (estate->es_partition_directory == NULL)
estate->es_partition_directory =
 CreatePartitionDirectory(estate->es_query_cxt, false);

Should it be " not omits" if I didn't  misunderstand. Because we pass false to the function.


I think if we could rewrite logic of CreatePartitionPruneState() as below:
if (partdesc->nparts == pinfo->nparts)
{
    /* no new partition and no detached partition */
}
else  if (partdesc->nparts >= pinfo->nparts)
{
    /* new partition */
}
else
{
  /* detached partition */
}

I haven't figured out a fix to the Scenario I found in last email.
--
Tender Wang
OpenPie:  https://en.openpie.com/
On 2024-Apr-18, Tender Wang wrote:

> Now we switch to gdb2, breakpoint at RelationCacheInvalidateEntry(). We
> continue gdb2, and we will
> stop at RelationCacheInvalidateEntry(). And we will see that p relation
> cache item will be cleared.
> The backtrace will be attached at the end of the this email.

Here is where I think the problem occurs -- I mean, surely
PlanCacheRelCallback marking the plan as ->is_valid=false should cause
the prepared query to be replanned, and at that point the replan would
see that the partition is no more.  So by the time we get to this:

> Entering ExecInitAppend(), because part_prune_info is not null, so we will
> enter CreatePartitionPruneState().
> We enter find_inheritance_children_extended() again to get partdesc, but in
> gdb1  we have done DetachPartitionFinalize()
> and the detach has commited. So we only get one tuple and parts is 1.

we have made a new plan, one whose planner partition descriptor has only
one partition, so it'd match what ExecInitAppend sees.

Evidently there's something I'm missing in how this plancache
invalidation works.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/





Alvaro Herrera <alvherre@alvh.no-ip.org> 于2024年5月14日周二 00:55写道:
On 2024-Apr-18, Tender Wang wrote:

> Now we switch to gdb2, breakpoint at RelationCacheInvalidateEntry(). We
> continue gdb2, and we will
> stop at RelationCacheInvalidateEntry(). And we will see that p relation
> cache item will be cleared.
> The backtrace will be attached at the end of the this email.

It happend in  deconstruct_jointree()(query_planner() called it), where planner had 
got partition information(parts is 2). After RelationCacheInvalidateEntry(), we will re-get
partition information in executor init phase.(parts is 1).

Here is where I think the problem occurs -- I mean, surely
PlanCacheRelCallback marking the plan as ->is_valid=false should cause
the prepared query to be replanned, and at that point the replan would
see that the partition is no more.  So by the time we get to this:

 

> Entering ExecInitAppend(), because part_prune_info is not null, so we will
> enter CreatePartitionPruneState().
> We enter find_inheritance_children_extended() again to get partdesc, but in
> gdb1  we have done DetachPartitionFinalize()
> and the detach has commited. So we only get one tuple and parts is 1.

we have made a new plan, one whose planner partition descriptor has only
one partition, so it'd match what ExecInitAppend sees.

Evidently there's something I'm missing in how this plancache
invalidation works.

Hmm, I don't think this issue is closely related to plancache invalidation.

The scenario,  which I created in [1],  because has no cached plan, so we will create a new plan.

After session1(doing detach) updated pg_inherits and before the first xact commited,
the session2(doing execute) get the snapshot. 
The session1 call WaitForLockersMultiple() before session2 get the parent relation lock.

when the session2 will get the partition information in planner phase
so the tuple in pg_inherits session1 updated would be visibility to session2, 
find_inheritance_children_extended() has below codes:

------
xmin = HeapTupleHeaderGetXmin(inheritsTuple->t_data);
snap = GetActiveSnapshot();

if (!XidInMVCCSnapshot(xmin, snap))
------

So the planner will have 2 parts.

The session1 will continue do detach work, because it has cross WaitForLockersMultiple(), so it will
call RemoveInheritance() to delete the tuple and add parent relation invalid message.
The session2 will enter RelationCacheInvalidateEntry() in deconstruct_jointree(), so partition information will
be clean.

When call ExecInitAppend(), we would get partition information again. The session1 has done the work, so the session2
will get 1 pg_inherits tuple.

Finally, we hit the assert.


Some key point:
1.  the updated pg_inherits tuple should be visibility for select.
2.  detach partition session can cross WaitForLockersMultiple()

We use two transactions to do detach partition concurrently. 
I have a question: can two transaction make sure of consistency?
--
Tender Wang
OpenPie:  https://en.openpie.com/
On 2024-May-13, Alvaro Herrera wrote:

> Evidently there's something I'm missing in how this plancache
> invalidation works.

Actually, what I was missing is that I had forgotten how
PartitionDirectory is actually used.  In the original code I wrote for
DETACH CONCURRENTLY for 2ndQPostgres a few years ago, I set things up so
that both planner and executor used the same PartitionDesc for a
partitioned table.  But when this was reworked for real Postgres by
Robert to implement concurrent ATTACH, he instead introduced
PartitionDirectory which keeps a PartitionDesc unchanged -- but planner
and executor use different PartitionDirectories.  I probably reworked
some of DETACH CONCURRENTLY to cope with that, but evidently must have
missed this scenario.

So the problem is that we read the partition descriptor during
re-planning with N partitions, and then if we receive the corresponding
invalidation message (at executor time) exactly before
CreatePartitionPruneState() creates its PartitionDirectory, then we read
N-1 partitions, and everything explodes.  (If the invalidation is
received at some other time, then the plan is remade before
ExecInitAppend by plancache, and everything works correctly.)

So one fix for this problem seems to be to patch
CreatePartitionPruneState() to accept the possibility that one partition
has been detached concurrently, and construct the planmap/partmap arrays
with a marking that the partition in question has been pruned away.


With that fix in place, I was now getting errors from
RelationBuildPartitionDesc() that some partition in the partdesc had
NULL relpartbound.  If I understand correctly, the problem here is that
we have a snapshot that still sees that partition as being in the
hierarchy, but in reality its relpartbound has been nullified by the
concurrent detach.  I can fix this symptom by doing
AcceptInvalidationMesages() and starting back at the top, where the list
of partitions is obtained by find_inheritance_children_extended().

With these two things in place , the pgbench scripts run without errors.
That's in patch 0001, which deserves a commit message that distills the
above.


I'm not convinced that this fix is correct or complete, but hey, it's
progress.  Here's some things I'm worried about:

1. this code assumes that a single partition can be in this "detach
concurrently" mode.  This sounds correct, because this operation needs
share update exclusive lock on the table, and we only allow one
partition to be in pending-detach state.  However, what if we have a
very slow process doing the query-side mode, and two processes manage to
detach two different tables in between?  I think this isn't possible
because of the wait-for-snapshots phase, but I need to convince myself
that this is the right explanation.

2. The new code in CreatePartitionPruneState assumes that if nparts
decreases, then it must be a detach, and if nparts increases, it must be
an attach.  Can these two things happen together in a way that we see
that the number of partitions remains the same, so we don't actually try
to construct planmap/partmap arrays by matching their OIDs?  I think the
only way to handle a possible problem here would be to verify the OIDs
every time we construct a partition descriptor.  I assume (without
checking) this would come with a performance cost, not sure.

3. The new stanza in CreatePartitionPruneState() should have more
sanity-checks that the two arrays we scan really match.

4. I made RelationBuildPartitionDesc retry only once.  I think this
should be enough, because a single AcceptInvalidationMessages should
process everything that we need to read to bring us up to current
reality, considering that DETACH CONCURRENTLY would need to wait for our
snapshot.

I would like to spend some more time on this, but I'm afraid I'll have
to put it off for a little while.

PS -- yes, the self-contradictory comment in CreatePartitionPruneState()
is bogus.  Patch 0002 fixes that.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Someone said that it is at least an order of magnitude more work to do
production software than a prototype. I think he is wrong by at least
an order of magnitude."                              (Brian Kernighan)

Вложения


Alvaro Herrera <alvherre@alvh.no-ip.org> 于2024年5月21日周二 00:16写道:
On 2024-May-13, Alvaro Herrera wrote:

> Evidently there's something I'm missing in how this plancache
> invalidation works.

Actually, what I was missing is that I had forgotten how
PartitionDirectory is actually used.  In the original code I wrote for
DETACH CONCURRENTLY for 2ndQPostgres a few years ago, I set things up so
that both planner and executor used the same PartitionDesc for a
partitioned table.  But when this was reworked for real Postgres by
Robert to implement concurrent ATTACH, he instead introduced
PartitionDirectory which keeps a PartitionDesc unchanged -- but planner
and executor use different PartitionDirectories.  I probably reworked
some of DETACH CONCURRENTLY to cope with that, but evidently must have
missed this scenario.

So the problem is that we read the partition descriptor during
re-planning with N partitions, and then if we receive the corresponding
invalidation message (at executor time) exactly before
CreatePartitionPruneState() creates its PartitionDirectory, then we read
N-1 partitions, and everything explodes.  (If the invalidation is
received at some other time, then the plan is remade before
ExecInitAppend by plancache, and everything works correctly.)

So one fix for this problem seems to be to patch
CreatePartitionPruneState() to accept the possibility that one partition
has been detached concurrently, and construct the planmap/partmap arrays
with a marking that the partition in question has been pruned away.


With that fix in place, I was now getting errors from
RelationBuildPartitionDesc() that some partition in the partdesc had
NULL relpartbound.  If I understand correctly, the problem here is that
we have a snapshot that still sees that partition as being in the
hierarchy, but in reality its relpartbound has been nullified by the
concurrent detach.  I can fix this symptom by doing
AcceptInvalidationMesages() and starting back at the top, where the list
of partitions is obtained by find_inheritance_children_extended().

With these two things in place , the pgbench scripts run without errors.
That's in patch 0001, which deserves a commit message that distills the
above.


I'm not convinced that this fix is correct or complete, but hey, it's
progress.  Here's some things I'm worried about:

1. this code assumes that a single partition can be in this "detach
concurrently" mode.  This sounds correct, because this operation needs
share update exclusive lock on the table, and we only allow one
partition to be in pending-detach state.  However, what if we have a
very slow process doing the query-side mode, and two processes manage to
detach two different tables in between?  I think this isn't possible
because of the wait-for-snapshots phase, but I need to convince myself
that this is the right explanation.

2. The new code in CreatePartitionPruneState assumes that if nparts
decreases, then it must be a detach, and if nparts increases, it must be
an attach.  Can these two things happen together in a way that we see
that the number of partitions remains the same, so we don't actually try
to construct planmap/partmap arrays by matching their OIDs?  I think the
only way to handle a possible problem here would be to verify the OIDs
every time we construct a partition descriptor.  I assume (without
checking) this would come with a performance cost, not sure.

3. The new stanza in CreatePartitionPruneState() should have more
sanity-checks that the two arrays we scan really match.

4. I made RelationBuildPartitionDesc retry only once.  I think this
should be enough, because a single AcceptInvalidationMessages should
process everything that we need to read to bring us up to current
reality, considering that DETACH CONCURRENTLY would need to wait for our
snapshot.

I would like to spend some more time on this, but I'm afraid I'll have
to put it off for a little while.

PS -- yes, the self-contradictory comment in CreatePartitionPruneState()
is bogus.  Patch 0002 fixes that.

I have tested this patch locally and did not encounter any failures.
I will take time to look the patch in detail and consider the issues you mentioned.
--
Tender Wang
OpenPie:  https://en.openpie.com/
On 2024-May-22, Tender Wang wrote:

> I have tested this patch locally and did not encounter any failures.
> I will take time to look the patch in detail and consider the issues you
> mentioned.

Thank you.  In the meantime,

> Alvaro Herrera <alvherre@alvh.no-ip.org> 于2024年5月21日周二 00:16写道:

> > 2. The new code in CreatePartitionPruneState assumes that if nparts
> > decreases, then it must be a detach, and if nparts increases, it must be
> > an attach.  Can these two things happen together in a way that we see
> > that the number of partitions remains the same, so we don't actually try
> > to construct planmap/partmap arrays by matching their OIDs?  I think the
> > only way to handle a possible problem here would be to verify the OIDs
> > every time we construct a partition descriptor.  I assume (without
> > checking) this would come with a performance cost, not sure.

I modified the scripts slightly so that two partitions would be
detached, and lo and behold -- the case where we have one new partition
appearing and one partition disappearing concurrently can indeed happen.
So we have that both nparts are identical, but the OID arrays don't
match.  I attach the scripts I used to test.

I think in order to fix this we would have to compare the OID arrays
each time through CreatePartitionPruneState, so that we can mark as
"pruned" (value -1) any partition that's not on either of the partdescs.
Having to compare the whole arrays each and every time might not be
great, but I don't see any backpatchable alternative at the moment.
Going forward, we could avoid the hit by having something like a
generation counter for the partitioned table (which is incremented for
each attach and detach), but of course that's not backpatchable.


PS: the pg_advisory_unlock() calls are necessary, because otherwise the
session that first succeeds the try_lock function retains the lock for
the whole duration of the pgbench script, so the other sessions always
skip the "\if :gotlock" block.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"I'm impressed how quickly you are fixing this obscure issue. I came from 
MS SQL and it would be hard for me to put into words how much of a better job
you all are doing on [PostgreSQL]."
 Steve Midgley, http://archives.postgresql.org/pgsql-sql/2008-08/msg00000.php

Вложения
On 2024-May-22, Tender Wang wrote:

> I have tested this patch locally and did not encounter any failures.
> I will take time to look the patch in detail and consider the issues you
> mentioned.

Thanks for testing!  I've spend some more time playing with this, and I
confirmed my fears that it's possible to hit the case where one
partition is added and another is removed, so the count of partitions is
the same, yet their OIDs don't match (problem [2] in my previous email).
This means we're forced to memcmp() the arrays even when nparts match.
We can reuse the maps if the partition OID arrays are bitwise identical,
otherwise we map the partitions one by one into subpart_map / subplan_map.

I ended up splitting the code fix in two parts, one for
RelationBuildPartitionDesc and the other for CreatePartitionPruneState,
because the two areas being touch have independent bugs that seem more
easily explained separately.  However, I was unable to trigger the bug
that 0001 fixes without involving the code in 0002.  The commit messges
should explain each bug and fix in enough detail, I hope.

I'm not happy about the comment changes in 0002: I think I'm losing some
of the original comments that should perhaps be retained.  I'll look
into that, but I'll push 0001 soon.

(Both these patches are needed from 14 onwards, where DETACH
CONCURRENTLY was added.)

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"In fact, the basic problem with Perl 5's subroutines is that they're not
crufty enough, so the cruft leaks out into user-defined code instead, by
the Conservation of Cruft Principle."  (Larry Wall, Apocalypse 6)

Вложения


Alvaro Herrera <alvherre@alvh.no-ip.org> 于2024年6月10日周一 23:33写道:
On 2024-May-22, Tender Wang wrote:

> I have tested this patch locally and did not encounter any failures.
> I will take time to look the patch in detail and consider the issues you
> mentioned.

Thanks for testing!  I've spend some more time playing with this, and I
confirmed my fears that it's possible to hit the case where one
partition is added and another is removed, so the count of partitions is
the same, yet their OIDs don't match (problem [2] in my previous email).
This means we're forced to memcmp() the arrays even when nparts match.
We can reuse the maps if the partition OID arrays are bitwise identical,
otherwise we map the partitions one by one into subpart_map / subplan_map.

I ended up splitting the code fix in two parts, one for
RelationBuildPartitionDesc and the other for CreatePartitionPruneState,
because the two areas being touch have independent bugs that seem more
easily explained separately.  However, I was unable to trigger the bug
that 0001 fixes without involving the code in 0002.  The commit messges
should explain each bug and fix in enough detail, I hope.

I‘m ok with 0001 and 0002. No failures any more on my machine.
 

I'm not happy about the comment changes in 0002: I think I'm losing some
of the original comments that should perhaps be retained.  I'll look
into that, but I'll push 0001 soon.

(Both these patches are needed from 14 onwards, where DETACH
CONCURRENTLY was added.)

--
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"In fact, the basic problem with Perl 5's subroutines is that they're not
crufty enough, so the cruft leaks out into user-defined code instead, by
the Conservation of Cruft Principle."  (Larry Wall, Apocalypse 6)


--
Tender Wang
OpenPie:  https://en.openpie.com/
On 2024-Jun-10, Alvaro Herrera wrote:

> Thanks for testing!  I've spend some more time playing with this, and I
> confirmed my fears that it's possible to hit the case where one
> partition is added and another is removed, so the count of partitions is
> the same, yet their OIDs don't match (problem [2] in my previous email).
> This means we're forced to memcmp() the arrays even when nparts match.
> We can reuse the maps if the partition OID arrays are bitwise identical,
> otherwise we map the partitions one by one into subpart_map / subplan_map.

... and actually, the code that maps partitions when these arrays don't
match is all wrong, because it assumes that they are in OID order, which
is not true, so I'm busy rewriting it.  More soon.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"After a quick R of TFM, all I can say is HOLY CR** THAT IS COOL! PostgreSQL was
amazing when I first started using it at 7.2, and I'm continually astounded by
learning new features and techniques made available by the continuing work of
the development team."
Berend Tober, http://archives.postgresql.org/pgsql-hackers/2007-08/msg01009.php



On 2024-Jun-11, Alvaro Herrera wrote:

> ... and actually, the code that maps partitions when these arrays don't
> match is all wrong, because it assumes that they are in OID order, which
> is not true, so I'm busy rewriting it.  More soon.

So I came up with the algorithm in the attached patch.  As far as I can
tell, it works okay; I've been trying to produce a test that would
stress it some more, but I noticed a pgbench shortcoming that I've been
trying to solve, unsuccessfully.

Anyway, the idea here is that we match the partdesc->oids entries to the
pinfo->relid_map entries with a distance allowance -- that is, we search
for some OIDs a few elements ahead of the current position.  This allows
us to skip some elements that do not match, without losing sync of the
correct position in the array.  This works because 1) the arrays must be
in the same order, that is, bound order; and 2) the amount of elements
that might be missing is bounded by the difference in array lengths.

As I said, I've been hammering it with some modified pgbench scripts;
mainly I did this to set up:

drop table if exists p;
do $$ declare i int; begin for i in 0..99 loop execute format('drop table if exists p%s', i); end loop; end $$;
drop sequence if exists detaches;
create table p (a int, b int) partition by list (a);
set client_min_messages=warning;
do $$
    declare i int;
    declare modulus int;
begin
    for modulus in 0 .. 4 loop
    for i in 0..99 loop
        if i % 5 <> modulus then
            continue;
        end if;
        execute format('create table p%s partition of p for values in (%s)', i, i); end loop;
    end loop;
end $$;
reset client_min_messages;
create sequence detaches;

which ensures the partitions are not in OID order, and then used this
pgbench script

\set part random(0, 89)

select pg_try_advisory_lock(:part)::integer AS gotlock \gset
\if :gotlock

select pg_advisory_lock(142857);
    alter table p detach partition p:part concurrently;
select pg_advisory_unlock(142857);

\set slp random(100, 200)
\sleep :slp us

alter table p attach partition p:part for values in (:part);

select pg_advisory_unlock(:part), nextval('detaches');
\endif


which detaches some partitions randomly, together with the other one

\set id random(0,99)
select * from p where a = :id;

script which reads from the partitioned table.

This setup would fail really quickly with the original code, and with
the patched code it can run a total of some 6700 detach/attach cycles in
60 seconds.  This seems quite slow, and in fact looking at the total
number of partitions in pg_inherits, we have either 99 or 100 partitions
almost the whole time.  That's why I'm trying to modify pgbench ...
I think the problem is that pg_advisory_lock() holds a snapshot which
causes a concurrent detach partition to wait for it, or something like
that.  I added a \lock command, but it doesn't seem to work the way I
want it to.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"I'm always right, but sometimes I'm more right than other times."
                                                  (Linus Torvalds)
https://lore.kernel.org/git/Pine.LNX.4.58.0504150753440.7211@ppc970.osdl.org/

Вложения


Alvaro Herrera <alvherre@alvh.no-ip.org> 于2024年6月13日周四 03:53写道:
On 2024-Jun-11, Alvaro Herrera wrote:

> ... and actually, the code that maps partitions when these arrays don't
> match is all wrong, because it assumes that they are in OID order, which
> is not true, so I'm busy rewriting it.  More soon.

So I came up with the algorithm in the attached patch.  As far as I can
tell, it works okay; I've been trying to produce a test that would
stress it some more, but I noticed a pgbench shortcoming that I've been
trying to solve, unsuccessfully.

Anyway, the idea here is that we match the partdesc->oids entries to the
pinfo->relid_map entries with a distance allowance -- that is, we search
for some OIDs a few elements ahead of the current position.  This allows
us to skip some elements that do not match, without losing sync of the
correct position in the array.  This works because 1) the arrays must be
in the same order, that is, bound order; and 2) the amount of elements
that might be missing is bounded by the difference in array lengths.

Thanks for your nice work.
I look through the patch, and it's ok. 

As I said, I've been hammering it with some modified pgbench scripts;
mainly I did this to set up:

drop table if exists p;
do $$ declare i int; begin for i in 0..99 loop execute format('drop table if exists p%s', i); end loop; end $$;
drop sequence if exists detaches;
create table p (a int, b int) partition by list (a);
set client_min_messages=warning;
do $$
        declare i int;
        declare modulus int;
begin
        for modulus in 0 .. 4 loop
        for i in 0..99 loop
                if i % 5 <> modulus then
                        continue;
                end if;
                execute format('create table p%s partition of p for values in (%s)', i, i); end loop;
        end loop;
end $$;
reset client_min_messages;
create sequence detaches;

which ensures the partitions are not in OID order, and then used this
pgbench script

\set part random(0, 89)

select pg_try_advisory_lock(:part)::integer AS gotlock \gset
\if :gotlock

select pg_advisory_lock(142857);
    alter table p detach partition p:part concurrently;
select pg_advisory_unlock(142857);

\set slp random(100, 200)
\sleep :slp us

alter table p attach partition p:part for values in (:part);

select pg_advisory_unlock(:part), nextval('detaches');
\endif


which detaches some partitions randomly, together with the other one

\set id random(0,99)
select * from p where a = :id;

script which reads from the partitioned table.

This setup would fail really quickly with the original code, and with
the patched code it can run a total of some 6700 detach/attach cycles in
60 seconds. 
 
Yeah,  nice catch. This will make the code more robust.
 
This seems quite slow, and in fact looking at the total
number of partitions in pg_inherits, we have either 99 or 100 partitions
almost the whole time.  That's why I'm trying to modify pgbench ...
I think the problem is that pg_advisory_lock() holds a snapshot which
causes a concurrent detach partition to wait for it, or something like
that.  I added a \lock command, but it doesn't seem to work the way I
want it to.

--
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"I'm always right, but sometimes I'm more right than other times."
                                                  (Linus Torvalds)
https://lore.kernel.org/git/Pine.LNX.4.58.0504150753440.7211@ppc970.osdl.org/


--
Tender Wang
OpenPie:  https://en.openpie.com/
I noticed further problems with this patch, and it turned out to have
other bugs that would manifest under very high concurrency of detach and
attach (I think the problem was a detach followed by an immediate attach
of the same partition, or perhaps it was an attach of a neighboring
partition), so I spent more time ironing those out and ended up with the
following, which I'm rather embarrased not to have had as the previous
version because it looks simpler and more obvious.

Unless further surprises arise soon, I'll push this soonish.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Ellos andaban todos desnudos como su madre los parió, y también las mujeres,
aunque no vi más que una, harto moza, y todos los que yo vi eran todos
mancebos, que ninguno vi de edad de más de XXX años" (Cristóbal Colón)

Вложения


Alvaro Herrera <alvherre@alvh.no-ip.org> 于2024年6月21日周五 20:39写道:
I noticed further problems with this patch, and it turned out to have
other bugs that would manifest under very high concurrency of detach and
attach (I think the problem was a detach followed by an immediate attach
of the same partition, or perhaps it was an attach of a neighboring
partition), so I spent more time ironing those out and ended up with the
following, which I'm rather embarrased not to have had as the previous
version because it looks simpler and more obvious.

Unless further surprises arise soon, I'll push this soonish.

 If we abstract the problem that this patch attempts to solve,it would try to find same elements on two arrays.
The lengths of these two arrays are not guaranteed to be equal. In this issue, even though the lengths of two arrays
is same, we can't assume the each element on two arrays is same.

The algorithm used in patch is correct if I understand correctly.

--
Tender Wang
On 2024-Jun-24, Tender Wang wrote:

>  If we abstract the problem that this patch attempts to solve,it would try
> to find same elements on two arrays.
> The lengths of these two arrays are not guaranteed to be equal. In this
> issue, even though the lengths of two arrays
> is same, we can't assume the each element on two arrays is same.
> 
> The algorithm used in patch is correct if I understand correctly.

Thanks for looking again.  I have finally pushed this.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Aprender sin pensar es inútil; pensar sin aprender, peligroso" (Confucio)



Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Thanks for looking again.  I have finally pushed this.

Um ... you are aware that we are in beta2 release freeze, no?

            regards, tom lane



On 2024-Jun-24, Tom Lane wrote:

> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > Thanks for looking again.  I have finally pushed this.
> 
> Um ... you are aware that we are in beta2 release freeze, no?

Sigh.  No, I missed that.  I'm going to revert it now, because I see that
skink failed.  I don't see how my patch is to blame since it's a timeout
and we have no logs for the failure; but maybe my patch did make
something slower and caused some test to go over the time limit.

I'd revert from master only though, to avoid noise in the other branches
that aren't in commit freeze and there are no BF failures.  Any hard
votes against doing that?

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Postgres is bloatware by design: it was built to house
 PhD theses." (Joey Hellerstein, SIGMOD annual conference 2002)



Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Sigh.  No, I missed that.  I'm going to revert it now, because I see that
> skink failed.  I don't see how my patch is to blame since it's a timeout
> and we have no logs for the failure; but maybe my patch did make
> something slower and caused some test to go over the time limit.

skink's been doing that intermittently for some time; I wouldn't
assume this patch is especially to blame.

Given the lack of other failures, maybe you should leave it as-is.
It's not really per policy, but after all this is just a beta ...

            regards, tom lane