RE: Excessive number of replication slots for 12->14 logical replication

Поиск
Список
Период
Сортировка
От houzj.fnst@fujitsu.com
Тема RE: Excessive number of replication slots for 12->14 logical replication
Дата
Msg-id OS0PR01MB5716E0913996BD51113403A094459@OS0PR01MB5716.jpnprd01.prod.outlook.com
обсуждение исходный текст
Ответ на RE: Excessive number of replication slots for 12->14 logical replication  ("houzj.fnst@fujitsu.com" <houzj.fnst@fujitsu.com>)
Ответы Re: Excessive number of replication slots for 12->14 logical replication  (Masahiko Sawada <sawada.mshk@gmail.com>)
Список pgsql-bugs
On Saturday, September 10, 2022 6:34 PM houzj.fnst@fujitsu.com wrote:
> 
> On Saturday, September 10, 2022 11:41 AM houzj.fnst@fujitsu.com wrote:
> >
> > On Saturday, September 10, 2022 5:49 AM Masahiko Sawada
> > <sawada.mshk@gmail.com> wrote:
> > >
> > > On Tue, Aug 30, 2022 at 3:44 PM Amit Kapila
> > > <amit.kapila16@gmail.com>
> > wrote:
> > > >
> > > > On Fri, Aug 26, 2022 at 7:04 AM Amit Kapila
> > > > <amit.kapila16@gmail.com>
> > > wrote:
> > > > >
> > > > > Thanks for the testing. I'll push this sometime early next week
> > > > > (by
> > > > > Tuesday) unless Sawada-San or someone else has any comments on it.
> > > > >
> > > >
> > > > Pushed.
> > >
> > > Tom reported buildfarm failures[1] and I've investigated the cause
> > > and concluded this commit is relevant.
> > >
> > >
> > > If the table sync worker errored at walrcv_endstreaming(), we
> > > assumed that both dropping the replication origin and updating
> > > relstate are rolled back, which however was wrong. Indeed, the
> > > replication origin is not dropped but the in-memory state is reset.
> > > Therefore, after the tablesync worker restarts, it starts logical
> > > replication with starting point 0/0. Consequently, it  ends up
> > > applying the transaction that has already
> > been applied.
> >
> > Thanks for the analysis !
> >
> > I think you are right. The replorigin_drop_by_name() will clear the
> > remote_lsn/local_lsn in shared memory which won't be rollback if we
> > fail to drop the origin.
> >
> > Per off-list discussion with Amit. To fix this problem, I think we
> > need to drop the origin in table sync worker after committing the
> > transaction which set the relstate to SYNCDONE. Because it can make
> > sure that the worker won’t be restarted even if we fail to drop the
> > origin. Besides, we need to add the origin drop code back in apply
> > worker in case the table sync worker failed to drop the origin before
> > it exits(which seems a rare case). I will share the patch if we agree with the fix.
> 
> Here is the draft patch. Share it here so that others can take a look at the basic
> logic. I will keep testing and improving it.

I tried to reproduce the reported problem by a) using gdb attach the table sync
worker and block it. b) then I start another session to begin a transaction to
write some data(INSERT 1111111) to the publisher table and commit it. c)
release the table sync worker and let it apply the changes d) Stop at the table
sync worker after dropping the origin and before dropping the slot, and jump
the code into an error path so that the table sync worker will error out and
restart. Then I see an error which shows that the same data is applied twice.

2022-09-10 19:27:51.205 CST [2830699] ERROR:  duplicate key value violates unique constraint "test_tab_pkey"
2022-09-10 19:27:51.205 CST [2830699] DETAIL:  Key (a)=(1111111) already exists.

And with the same steps it works fine after applying the patch.

Attach the patch again with some cosmetic changes.

Best regards,
Hou zj




Вложения

В списке pgsql-bugs по дате отправления:

Предыдущее
От: "David G. Johnston"
Дата:
Сообщение: Bug in UPDATE statement
Следующее
От: Etsuro Fujita
Дата:
Сообщение: Re: foreign join error "variable not found in subplan target list"