Re: Assertion failure in SnapBuildInitialSnapshot()

Поиск
Список
Период
Сортировка
От Amit Kapila
Тема Re: Assertion failure in SnapBuildInitialSnapshot()
Дата
Msg-id CAA4eK1Lieo9mw0dEB0MEwhrOsZTN7WZ-si4N-DL3bjm9-UmJKQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Assertion failure in SnapBuildInitialSnapshot()  (Amit Kapila <amit.kapila16@gmail.com>)
Ответы Re: Assertion failure in SnapBuildInitialSnapshot()  (Amit Kapila <amit.kapila16@gmail.com>)
Список pgsql-hackers
On Mon, Jan 30, 2023 at 10:27 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, Jan 29, 2023 at 9:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Sat, Jan 28, 2023 at 11:54 PM Hayato Kuroda (Fujitsu)
> > <kuroda.hayato@fujitsu.com> wrote:
> > >
> > > Dear Amit, Sawada-san,
> > >
> > > I have also reproduced the failure on PG15 with some debug log, and I agreed that
> > > somebody changed procArray->replication_slot_xmin to InvalidTransactionId.
> > >
> > > > > The same assertion failure has been reported on another thread[1].
> > > > > Since I could reproduce this issue several times in my environment
> > > > > I've investigated the root cause.
> > > > >
> > > > > I think there is a race condition of updating
> > > > > procArray->replication_slot_xmin by CreateInitDecodingContext() and
> > > > > LogicalConfirmReceivedLocation().
> > > > >
> > > > > What I observed in the test was that a walsender process called:
> > > > > SnapBuildProcessRunningXacts()
> > > > >   LogicalIncreaseXminForSlot()
> > > > >     LogicalConfirmReceivedLocation()
> > > > >       ReplicationSlotsComputeRequiredXmin(false).
> > > > >
> > > > > In ReplicationSlotsComputeRequiredXmin() it acquired the
> > > > > ReplicationSlotControlLock and got 0 as the minimum xmin since there
> > > > > was no wal sender having effective_xmin.
> > > > >
> > > >
> > > > What about the current walsender process which is processing
> > > > running_xacts via SnapBuildProcessRunningXacts()? Isn't that walsender
> > > > slot's effective_xmin have a non-zero value? If not, then why?
> > >
> > > Normal walsenders which are not for tablesync create a replication slot with
> > > NOEXPORT_SNAPSHOT option. I think in this case, CreateInitDecodingContext() is
> > > called with need_full_snapshot = false, and slot->effective_xmin is not updated.
> >
> > Right. This is how we create a slot used by an apply worker.
> >
>
> I was thinking about how that led to this problem because
> GetOldestSafeDecodingTransactionId() ignores InvalidTransactionId.
>

I have reproduced it manually. For this, I had to manually make the
debugger call ReplicationSlotsComputeRequiredXmin(false) via path
SnapBuildProcessRunningXacts()->LogicalIncreaseXminForSlot()->LogicalConfirmReceivedLocation()
->ReplicationSlotsComputeRequiredXmin(false) for the apply worker. The
sequence of events is something like (a) the replication_slot_xmin for
tablesync worker is overridden by apply worker as zero as explained in
Sawada-San's email, (b) another transaction happened on the publisher
that will increase the value of ShmemVariableCache->nextXid (c)
tablesync worker invokes
SnapBuildInitialSnapshot()->GetOldestSafeDecodingTransactionId() which
will return an oldestSafeXid which is higher than snapshot's xmin.
This happens because replication_slot_xmin has an InvalidTransactionId
value and we won't consider replication_slot_catalog_xmin because
catalogOnly flag is false and there is no other open running
transaction. I think we should try to get a simplified test to
reproduce this problem if possible.

-- 
With Regards,
Amit Kapila.



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: Something is wrong with wal_compression
Следующее
От: Amit Kapila
Дата:
Сообщение: Re: Perform streaming logical transactions by background workers and parallel apply