Re: Assertion failure in SnapBuildInitialSnapshot()

Поиск
Список
Период
Сортировка
От Pradeep Kumar
Тема Re: Assertion failure in SnapBuildInitialSnapshot()
Дата
Msg-id CAJ4xhPmdjViFTGTsLd2VuziSYYVUCfq9v-t-dL-WYT__RpouiQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Assertion failure in SnapBuildInitialSnapshot()  (Masahiko Sawada <sawada.mshk@gmail.com>)
Ответы RE: Assertion failure in SnapBuildInitialSnapshot()
Список pgsql-hackers
I've been investigating the assert failure in ProcArraySetReplicationSlotXmin() and would like to share my approach and get feedback. Instead of inverting the locks and what robert shared before [1].
Instead of unconditionally updating procArray->replication_slot_xmin in ProcArraySetReplicationSlotXmin() in procarray.c, I made the updates conditional:
1) Only update if the incoming xmin is valid
2) Only update if it's older than the currently stored xmin
3) Do the same for procArray->replication_slot_catalog_xmin

void ProcArraySetReplicationSlotXmin(TransactionId xmin, TransactionId catalog_xmin, bool already_locked)
{
       Assert(!already_locked || LWLockHeldByMe(ProcArrayLock));

       if (!already_locked)
           LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);

       if (TransactionIdIsValid(xmin))
      {
           if (!TransactionIdIsValid(procArray->replication_slot_xmin) || TransactionIdPrecedes(xmin, procArray->replication_slot_xmin))
                 procArray->replication_slot_xmin = xmin;
      }
      if (TransactionIdIsValid(catalog_xmin))
      {
             if (!TransactionIdIsValid(procArray->replication_slot_catalog_xmin) || TransactionIdPrecedes(catalog_xmin, procArray->replication_slot_catalog_xmin))
                   procArray->replication_slot_catalog_xmin = catalog_xmin;
      }
      if (!already_locked)
          LWLockRelease(ProcArrayLock);

     elog(DEBUG1, "xmin required by slots: data %u, catalog %u", xmin, catalog_xmin);
}

In above block of code ensures we always track the minimum xmin across all active replication slots without losing data. And also no need to worry about locks. And also while reproducing this issue [2] In SnapBuildInitialSnapshot() while we computing safexid by calling GetOldestSafeDecodingTransactionId(false) will enters into first case and update the oldestSafeXid = procArray->replication_slot_xmin. So it won't return nextXid. And also it solves this issue [2].

[1] https://www.postgresql.org/message-id/CA+TgmoYLzJxCEa0aCan3KR7o_25G52cbqw-90Q0VGRmV3a8XGQ@mail.gmail.com
[2] https://www.postgresql.org/message-id/CAA4eK1KDFeh=ZbvSWPx=ir2QOXBxJbH0K8YqifDtG3xJENLR+w@mail.gmail.com

On Fri, Nov 7, 2025 at 11:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Thu, Nov 6, 2025 at 8:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Nov 7, 2025 at 8:30 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Friday, November 7, 2025 2:36 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > On Thu, Nov 6, 2025 at 2:36 AM Amit Kapila <amit.kapila16@gmail.com>
> > > wrote:
> > > >
> > > > Good point. This can happen when the last slot is invalidated or dropped.
> > >
> > > After the last slot is invalidated or dropped, both slot_xmin and
> > > slot_catalog_xmin values are set InvalidTransactionId. Then in this
> > > case, these values are ignored when computing the oldest safe decoding
> > > XID in GetOldestSafeDecodingTransactionId(), no? Or do you mean that
> > > there is a case where slot_xmin and slot_catalog_xmin retreat to a
> > > valid XID?
> >
> > I think when replication_slot_xmin is invalid,
> > GetOldestSafeDecodingTransactionId would return nextXid, which can be greater
> > than the original snap.xmin if some transaction IDs have been assigned.
> >
>
> Won't we have a problem that values of
> procArray->replication_slot_xmin and
> procArray->replication_slot_catalog_xmin won't be set to
> InvalidTransactionId after last slot removal due to a new check unless
> we do special treatment for drop/invalidation of a slot? And that
> would lead to accumulating dead rows even when not required.

I understand Hou-san's point. Agreed. procArray->replication_slot_xmin
and replication_slot_catalog_xmin should not retreat to a valid XID
but could become 0 (invalid). Let's consider the idea of inverting the
locks as Andres proposed[1].

Regards,

[1] https://www.postgresql.org/message-id/20230207194903.ws4acm7ake6ikacn%40awork3.anarazel.de

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

В списке pgsql-hackers по дате отправления: