RE: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

Поиск
Список
Период
Сортировка
От Zhijie Hou (Fujitsu)
Тема RE: BUG #18055: logical decoding core on AllocateSnapshotBuilder()
Дата
Msg-id OS0PR01MB57163F3F1BF6B818ADE34862941AA@OS0PR01MB5716.jpnprd01.prod.outlook.com
обсуждение исходный текст
Ответ на Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()  (Amit Kapila <amit.kapila16@gmail.com>)
Ответы Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()  (Masahiko Sawada <sawada.mshk@gmail.com>)
Список pgsql-bugs
On Thursday, August 17, 2023 6:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Thu, Aug 17, 2023 at 3:33 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > On Thu, Aug 17, 2023 at 12:21 PM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > On Tuesday, August 15, 2023 12:05 AM PG Bug reporting form
> <noreply@postgresql.org> wrote:
> > > >
> > > > The following bug has been logged on the website:
> > > >
> > > > Bug reference:      18055
> > > > Logged by:          ocean li
> > > > Email address:      ocean_li_996@163.com
> > > > PostgreSQL version: 11.9
> > > > Operating system:   centos7 5.10.84 x86_64
> > > > Description:
> > > >
> > > > For testing logical decoding module, *pg_logical_slot_get_changes*
> > > > function is used. Sometimes i got an core whose stack was like that:
> > > > ...
> > > > And in level #3 of stack above, NInitialRunningXacts is 2 and
> > > > InitialRunningXacts is not NULL observed in one of  cores.
> > > >
> > > > Using of NInitialRunningXacts and InitialRunningXacts are clear.
> > > > Currently, the core, as far as i know, maybe caused by this way:
> > > > an ERROR raised when calling
> > > > *pg_logical_slot_get_changes_guts* function. The code part of
> > > > PG_CATCH() doses not reset  NInitialRunningXacts and
> InitialRunningXacts.
> > > > Then, calling pg_logical_slot_get_changes_guts again, the core may
> occur.
> > > > Unfortunately, I couldn't find a minimal reproduction case.
> > > > However, I observed an *ERROR: canceling statement due to
> > > > statement timeout*  logged before each core occurred. (For some
> > > > reason, I can't provide the information of
> > > > log)
> > >
> > > Thanks for the report and the fix!
> > >
> > > I can also reproduce by the steps[1] in PG15~11(Note that we need to
> > > change LOG_SNAPSHOT_INTERVAL_MS to a bigger number to avoid extra
> > > running xacts wal records).
> > >
> > > About the patch:
> > >
> > > ---
> > >                 /* free context, call shutdown callback */
> > >                 FreeDecodingContext(ctx);
> > >
> > >                 ReplicationSlotRelease();
> > >                 InvalidateSystemCaches();
> > >         }
> > >         PG_CATCH();
> > >         {
> > > +
> > > +               /* free context, call shutdown callback */
> > > +               FreeDecodingContext(ctx);
> > > +
> > > +               ReplicationSlotRelease();
> > >
> > > I think we could not directly call the cleanup functions here. There
> > > could be two risks:
> > > 1) it's possible either 'ctx' hasn't been initialized before the
> > > error happen, we don't need to clean it up in this case
> > > 2) It's possible 'ctx' or 'MyReplicationSlot' been be cleaned up
> > > normally before entering the catch() block which means we will double
> cleanup(free) it.
> > >
> > > To improve this, I think we can use PG_FINALLY() here and move all
> > > these cleanup functions in it and do the null check for 'ctx' before cleaning
> up.
> > >
> > > Just share one small patch based on your fix for reference, feel free to
> update it.
> > >
> >
> > - LogicalDecodingContext *ctx;
> > + LogicalDecodingContext *ctx = NULL;
> >
> > Don't we need to use volatile for ctx? See the following comment in
> > elog.h: "Note: if a local variable of the function containing PG_TRY
> > is modified in the PG_TRY section and used in the PG_CATCH section,
> > that variable must be declared "volatile" for POSIX compliance."?
> >

Thanks for pointing it out, I think we need to use volatile for it.

> 
> BTW, don't we need a similar change in pg_logical_replication_slot_advance()?

You are right, I can reproduce the same following the steps in [1] and
replace the last two pg_logical_slot_get_changes function calls with two
"SELECT pg_replication_slot_advance('isolation_slot', (select
pg_current_wal_lsn()));"

Here is a new version patch based on PG15 STABLE which addressed above comments.

[1]
https://www.postgresql.org/message-id/OS0PR01MB57160A7DD10B517AFDC73378941AA%40OS0PR01MB5716.jpnprd01.prod.outlook.com

Best Regards,
Hou zj

Вложения

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Amit Kapila
Дата:
Сообщение: Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()
Следующее
От: Masahiko Sawada
Дата:
Сообщение: Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()