Re: Excessive number of replication slots for 12->14 logical replication
От | Masahiko Sawada |
---|---|
Тема | Re: Excessive number of replication slots for 12->14 logical replication |
Дата | |
Msg-id | CAD21AoC6dpzh3LzsiV1RzYR8WytO8TtkdGnLUa_dqZCGC=n_AA@mail.gmail.com обсуждение исходный текст |
Ответ на | RE: Excessive number of replication slots for 12->14 logical replication ("houzj.fnst@fujitsu.com" <houzj.fnst@fujitsu.com>) |
Ответы |
Re: Excessive number of replication slots for 12->14 logical replication
|
Список | pgsql-bugs |
On Sun, Sep 11, 2022 at 11:55 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote: > > On Saturday, September 10, 2022 6:34 PM houzj.fnst@fujitsu.com wrote: > > > > On Saturday, September 10, 2022 11:41 AM houzj.fnst@fujitsu.com wrote: > > > > > > On Saturday, September 10, 2022 5:49 AM Masahiko Sawada > > > <sawada.mshk@gmail.com> wrote: > > > > > > > > On Tue, Aug 30, 2022 at 3:44 PM Amit Kapila > > > > <amit.kapila16@gmail.com> > > > wrote: > > > > > > > > > > On Fri, Aug 26, 2022 at 7:04 AM Amit Kapila > > > > > <amit.kapila16@gmail.com> > > > > wrote: > > > > > > > > > > > > Thanks for the testing. I'll push this sometime early next week > > > > > > (by > > > > > > Tuesday) unless Sawada-San or someone else has any comments on it. > > > > > > > > > > > > > > > > Pushed. > > > > > > > > Tom reported buildfarm failures[1] and I've investigated the cause > > > > and concluded this commit is relevant. > > > > > > > > > > > > If the table sync worker errored at walrcv_endstreaming(), we > > > > assumed that both dropping the replication origin and updating > > > > relstate are rolled back, which however was wrong. Indeed, the > > > > replication origin is not dropped but the in-memory state is reset. > > > > Therefore, after the tablesync worker restarts, it starts logical > > > > replication with starting point 0/0. Consequently, it ends up > > > > applying the transaction that has already > > > been applied. > > > > > > Thanks for the analysis ! > > > > > > I think you are right. The replorigin_drop_by_name() will clear the > > > remote_lsn/local_lsn in shared memory which won't be rollback if we > > > fail to drop the origin. > > > > > > Per off-list discussion with Amit. To fix this problem, I think we > > > need to drop the origin in table sync worker after committing the > > > transaction which set the relstate to SYNCDONE. Because it can make > > > sure that the worker won’t be restarted even if we fail to drop the > > > origin. Besides, we need to add the origin drop code back in apply > > > worker in case the table sync worker failed to drop the origin before > > > it exits(which seems a rare case). I will share the patch if we agree with the fix. > > > > Here is the draft patch. Share it here so that others can take a look at the basic > > logic. I will keep testing and improving it. > > I tried to reproduce the reported problem by a) using gdb attach the table sync > worker and block it. b) then I start another session to begin a transaction to > write some data(INSERT 1111111) to the publisher table and commit it. c) > release the table sync worker and let it apply the changes d) Stop at the table > sync worker after dropping the origin and before dropping the slot, and jump > the code into an error path so that the table sync worker will error out and > restart. Then I see an error which shows that the same data is applied twice. > > 2022-09-10 19:27:51.205 CST [2830699] ERROR: duplicate key value violates unique constraint "test_tab_pkey" > 2022-09-10 19:27:51.205 CST [2830699] DETAIL: Key (a)=(1111111) already exists. > > And with the same steps it works fine after applying the patch. > > Attach the patch again with some cosmetic changes. Thank you for the patch! I agree with the approach of the patch to fix this issue. And I've confirmed the issue doesn't happen with this patch. Here are some review comments: /* * UpdateSubscriptionRelState must be called within a transaction. - * That transaction will be ended within the finish_sync_worker(). */ I think we can move the removed sentence to where we added StartTransactionCommand(). For example, * Start a new transaction to cleanup the tablesync origin tracking. * That transaction will be ended within the finish_sync_worker(). --- * This has to be done after the data changes because otherwise if I think we can change this sentence back to the one we had before: * This has to be done after updating the state because otherwise if --- + CommitTransactionCommand(); + We need to do pgstat_report_stat() since we performed DML. --- + /* + * Start a new transaction to cleanup the tablesync origin tracking. + * + * We need to do this after the table state is set to SYNCDONE, + * otherwise if an error occurs while performing the database + * operation, the worker will be restarted, but the in-memory + * replication progress(remote_lsn) has been cleaned and will not be + * rolledback, so the restarted worker will use invalid replication + * progress resulting in replay of transactions that have already been + * applied. + */ How about mentioning that even if the tablesync worker failed to drop the replication origin, the tablesync worker won't restart but the apply worker will do that? Regards, -- Masahiko Sawada
В списке pgsql-bugs по дате отправления:
Предыдущее
От: Etsuro FujitaДата:
Сообщение: Re: foreign join error "variable not found in subplan target list"
Следующее
От: Amit KapilaДата:
Сообщение: Re: Excessive number of replication slots for 12->14 logical replication