Re: Assertion failure in SnapBuildInitialSnapshot()
От | Amit Kapila |
---|---|
Тема | Re: Assertion failure in SnapBuildInitialSnapshot() |
Дата | |
Msg-id | CAA4eK1Lieo9mw0dEB0MEwhrOsZTN7WZ-si4N-DL3bjm9-UmJKQ@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Assertion failure in SnapBuildInitialSnapshot() (Amit Kapila <amit.kapila16@gmail.com>) |
Ответы |
Re: Assertion failure in SnapBuildInitialSnapshot()
(Amit Kapila <amit.kapila16@gmail.com>)
|
Список | pgsql-hackers |
On Mon, Jan 30, 2023 at 10:27 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sun, Jan 29, 2023 at 9:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Sat, Jan 28, 2023 at 11:54 PM Hayato Kuroda (Fujitsu) > > <kuroda.hayato@fujitsu.com> wrote: > > > > > > Dear Amit, Sawada-san, > > > > > > I have also reproduced the failure on PG15 with some debug log, and I agreed that > > > somebody changed procArray->replication_slot_xmin to InvalidTransactionId. > > > > > > > > The same assertion failure has been reported on another thread[1]. > > > > > Since I could reproduce this issue several times in my environment > > > > > I've investigated the root cause. > > > > > > > > > > I think there is a race condition of updating > > > > > procArray->replication_slot_xmin by CreateInitDecodingContext() and > > > > > LogicalConfirmReceivedLocation(). > > > > > > > > > > What I observed in the test was that a walsender process called: > > > > > SnapBuildProcessRunningXacts() > > > > > LogicalIncreaseXminForSlot() > > > > > LogicalConfirmReceivedLocation() > > > > > ReplicationSlotsComputeRequiredXmin(false). > > > > > > > > > > In ReplicationSlotsComputeRequiredXmin() it acquired the > > > > > ReplicationSlotControlLock and got 0 as the minimum xmin since there > > > > > was no wal sender having effective_xmin. > > > > > > > > > > > > > What about the current walsender process which is processing > > > > running_xacts via SnapBuildProcessRunningXacts()? Isn't that walsender > > > > slot's effective_xmin have a non-zero value? If not, then why? > > > > > > Normal walsenders which are not for tablesync create a replication slot with > > > NOEXPORT_SNAPSHOT option. I think in this case, CreateInitDecodingContext() is > > > called with need_full_snapshot = false, and slot->effective_xmin is not updated. > > > > Right. This is how we create a slot used by an apply worker. > > > > I was thinking about how that led to this problem because > GetOldestSafeDecodingTransactionId() ignores InvalidTransactionId. > I have reproduced it manually. For this, I had to manually make the debugger call ReplicationSlotsComputeRequiredXmin(false) via path SnapBuildProcessRunningXacts()->LogicalIncreaseXminForSlot()->LogicalConfirmReceivedLocation() ->ReplicationSlotsComputeRequiredXmin(false) for the apply worker. The sequence of events is something like (a) the replication_slot_xmin for tablesync worker is overridden by apply worker as zero as explained in Sawada-San's email, (b) another transaction happened on the publisher that will increase the value of ShmemVariableCache->nextXid (c) tablesync worker invokes SnapBuildInitialSnapshot()->GetOldestSafeDecodingTransactionId() which will return an oldestSafeXid which is higher than snapshot's xmin. This happens because replication_slot_xmin has an InvalidTransactionId value and we won't consider replication_slot_catalog_xmin because catalogOnly flag is false and there is no other open running transaction. I think we should try to get a simplified test to reproduce this problem if possible. -- With Regards, Amit Kapila.
В списке pgsql-hackers по дате отправления:
Следующее
От: Amit KapilaДата:
Сообщение: Re: Perform streaming logical transactions by background workers and parallel apply