RE: Excessive number of replication slots for 12->14 logical replication

Поиск
Список
Период
Сортировка
От houzj.fnst@fujitsu.com
Тема RE: Excessive number of replication slots for 12->14 logical replication
Дата
Msg-id OS0PR01MB5716E1D75CFDB08E939E521794429@OS0PR01MB5716.jpnprd01.prod.outlook.com
обсуждение исходный текст
Ответ на RE: Excessive number of replication slots for 12->14 logical replication  ("houzj.fnst@fujitsu.com" <houzj.fnst@fujitsu.com>)
Ответы RE: Excessive number of replication slots for 12->14 logical replication  ("houzj.fnst@fujitsu.com" <houzj.fnst@fujitsu.com>)
Список pgsql-bugs
On Saturday, September 10, 2022 11:41 AM houzj.fnst@fujitsu.com wrote:
> 
> On Saturday, September 10, 2022 5:49 AM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Aug 30, 2022 at 3:44 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > >
> > > On Fri, Aug 26, 2022 at 7:04 AM Amit Kapila
> > > <amit.kapila16@gmail.com>
> > wrote:
> > > >
> > > > Thanks for the testing. I'll push this sometime early next week
> > > > (by
> > > > Tuesday) unless Sawada-San or someone else has any comments on it.
> > > >
> > >
> > > Pushed.
> >
> > Tom reported buildfarm failures[1] and I've investigated the cause and
> > concluded this commit is relevant.
> >
> > In process_syncing_tables_for_sync(), we have the following code:
> >
> >         UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
> >                                    MyLogicalRepWorker->relid,
> >                                    MyLogicalRepWorker->relstate,
> >                                    MyLogicalRepWorker->relstate_lsn);
> >
> >         ReplicationOriginNameForTablesync(MyLogicalRepWorker->subid,
> >                                           MyLogicalRepWorker->relid,
> >                                           originname,
> >                                           sizeof(originname));
> >         replorigin_session_reset();
> >         replorigin_session_origin = InvalidRepOriginId;
> >         replorigin_session_origin_lsn = InvalidXLogRecPtr;
> >         replorigin_session_origin_timestamp = 0;
> >
> >         /*
> >          * We expect that origin must be present. The concurrent operations
> >          * that remove origin like a refresh for the subscription take an
> >          * access exclusive lock on pg_subscription which prevent the previou
> >          * operation to update the rel state to SUBREL_STATE_SYNCDONE to
> >          * succeed.
> >          */
> >         replorigin_drop_by_name(originname, false, false);
> >
> >         /*
> >          * End streaming so that LogRepWorkerWalRcvConn can be used to
> > drop
> >          * the slot.
> >          */
> >         walrcv_endstreaming(LogRepWorkerWalRcvConn, &tli);
> >
> >         /*
> >          * Cleanup the tablesync slot.
> >          *
> >          * This has to be done after the data changes because otherwise if
> >          * there is an error while doing the database operations we won't be
> >          * able to rollback dropped slot.
> >          */
> >         ReplicationSlotNameForTablesync(MyLogicalRepWorker->subid,
> >                                         MyLogicalRepWorker->relid,
> >                                         syncslotname,
> >                                         sizeof(syncslotname));
> >
> > If the table sync worker errored at walrcv_endstreaming(), we assumed
> > that both dropping the replication origin and updating relstate are
> > rolled back, which however was wrong. Indeed, the replication origin
> > is not dropped but the in-memory state is reset. Therefore, after the
> > tablesync worker restarts, it starts logical replication with starting
> > point 0/0. Consequently, it  ends up applying the transaction that has already
> been applied.
> 
> Thanks for the analysis !
> 
> I think you are right. The replorigin_drop_by_name() will clear the
> remote_lsn/local_lsn in shared memory which won't be rollback if we fail to drop
> the origin.
> 
> Per off-list discussion with Amit. To fix this problem, I think we need to drop the
> origin in table sync worker after committing the transaction which set the
> relstate to SYNCDONE. Because it can make sure that the worker won’t be
> restarted even if we fail to drop the origin. Besides, we need to add the origin
> drop code back in apply worker in case the table sync worker failed to drop the
> origin before it exits(which seems a rare case). I will share the patch if we agree
> with the fix.

Here is the draft patch. Share it here so that others can take a look at the basic logic.
I will keep testing and improving it.

Best regards,
Hou zj




Вложения

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Amit Kapila
Дата:
Сообщение: Re: Excessive number of replication slots for 12->14 logical replication
Следующее
От: "James Pang (chaolpan)"
Дата:
Сообщение: RE: huge memory of Postgresql backend process