RE: Excessive number of replication slots for 12->14 logical replication
От | houzj.fnst@fujitsu.com |
---|---|
Тема | RE: Excessive number of replication slots for 12->14 logical replication |
Дата | |
Msg-id | OS0PR01MB5716E1D75CFDB08E939E521794429@OS0PR01MB5716.jpnprd01.prod.outlook.com обсуждение исходный текст |
Ответ на | RE: Excessive number of replication slots for 12->14 logical replication ("houzj.fnst@fujitsu.com" <houzj.fnst@fujitsu.com>) |
Ответы |
RE: Excessive number of replication slots for 12->14 logical replication
|
Список | pgsql-bugs |
On Saturday, September 10, 2022 11:41 AM houzj.fnst@fujitsu.com wrote: > > On Saturday, September 10, 2022 5:49 AM Masahiko Sawada > <sawada.mshk@gmail.com> wrote: > > > > On Tue, Aug 30, 2022 at 3:44 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > > > On Fri, Aug 26, 2022 at 7:04 AM Amit Kapila > > > <amit.kapila16@gmail.com> > > wrote: > > > > > > > > Thanks for the testing. I'll push this sometime early next week > > > > (by > > > > Tuesday) unless Sawada-San or someone else has any comments on it. > > > > > > > > > > Pushed. > > > > Tom reported buildfarm failures[1] and I've investigated the cause and > > concluded this commit is relevant. > > > > In process_syncing_tables_for_sync(), we have the following code: > > > > UpdateSubscriptionRelState(MyLogicalRepWorker->subid, > > MyLogicalRepWorker->relid, > > MyLogicalRepWorker->relstate, > > MyLogicalRepWorker->relstate_lsn); > > > > ReplicationOriginNameForTablesync(MyLogicalRepWorker->subid, > > MyLogicalRepWorker->relid, > > originname, > > sizeof(originname)); > > replorigin_session_reset(); > > replorigin_session_origin = InvalidRepOriginId; > > replorigin_session_origin_lsn = InvalidXLogRecPtr; > > replorigin_session_origin_timestamp = 0; > > > > /* > > * We expect that origin must be present. The concurrent operations > > * that remove origin like a refresh for the subscription take an > > * access exclusive lock on pg_subscription which prevent the previou > > * operation to update the rel state to SUBREL_STATE_SYNCDONE to > > * succeed. > > */ > > replorigin_drop_by_name(originname, false, false); > > > > /* > > * End streaming so that LogRepWorkerWalRcvConn can be used to > > drop > > * the slot. > > */ > > walrcv_endstreaming(LogRepWorkerWalRcvConn, &tli); > > > > /* > > * Cleanup the tablesync slot. > > * > > * This has to be done after the data changes because otherwise if > > * there is an error while doing the database operations we won't be > > * able to rollback dropped slot. > > */ > > ReplicationSlotNameForTablesync(MyLogicalRepWorker->subid, > > MyLogicalRepWorker->relid, > > syncslotname, > > sizeof(syncslotname)); > > > > If the table sync worker errored at walrcv_endstreaming(), we assumed > > that both dropping the replication origin and updating relstate are > > rolled back, which however was wrong. Indeed, the replication origin > > is not dropped but the in-memory state is reset. Therefore, after the > > tablesync worker restarts, it starts logical replication with starting > > point 0/0. Consequently, it ends up applying the transaction that has already > been applied. > > Thanks for the analysis ! > > I think you are right. The replorigin_drop_by_name() will clear the > remote_lsn/local_lsn in shared memory which won't be rollback if we fail to drop > the origin. > > Per off-list discussion with Amit. To fix this problem, I think we need to drop the > origin in table sync worker after committing the transaction which set the > relstate to SYNCDONE. Because it can make sure that the worker won’t be > restarted even if we fail to drop the origin. Besides, we need to add the origin > drop code back in apply worker in case the table sync worker failed to drop the > origin before it exits(which seems a rare case). I will share the patch if we agree > with the fix. Here is the draft patch. Share it here so that others can take a look at the basic logic. I will keep testing and improving it. Best regards, Hou zj
Вложения
В списке pgsql-bugs по дате отправления:
Предыдущее
От: Amit KapilaДата:
Сообщение: Re: Excessive number of replication slots for 12->14 logical replication
Следующее
От: "James Pang (chaolpan)"Дата:
Сообщение: RE: huge memory of Postgresql backend process