Re: BUGFIX: standby disconnect can corrupt serialized reorder buffers

Поиск
Список
Период
Сортировка
От Craig Ringer
Тема Re: BUGFIX: standby disconnect can corrupt serialized reorder buffers
Дата
Msg-id CAMsr+YFw4arf=LtaYgVL47J9y_t5uNkPpw1zS8tcbLn6REEJ9A@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUGFIX: standby disconnect can corrupt serialized reorder buffers  (Stephen Frost <sfrost@snowman.net>)
Список pgsql-hackers
On 5 January 2018 at 12:16, Stephen Frost <sfrost@snowman.net> wrote:
Greetings all,

* Masahiko Sawada (sawada.mshk@gmail.com) wrote:
> On Tue, Dec 26, 2017 at 10:03 PM, Petr Jelinek
> <petr.jelinek@2ndquadrant.com> wrote:
> > On 26/12/17 11:13, Masahiko Sawada wrote:
> >> On Tue, Dec 26, 2017 at 12:49 AM, Petr Jelinek
> >> <petr.jelinek@2ndquadrant.com> wrote:
> >>
> >>>>
> >>>> It's not a problem on crash restart because StartupReorderBuffer already
> >>>> does the required delete.
> >>>>
> >>>> ReorderBufferSerializeTXN, which spills the txns to disk, doesn't appear
> >>>> to have any guard to ensure that the segment files don't already exist
> >>>> when it goes to serialize a snapshot. Adding one there would probably be
> >>>> expensive; we don't know the last lsn of the txn yet, so to be really
> >>>> safe we'd have to do a directory listing and scan for any txn-$OURXID-*
> >>>> entries.
> >>>>
> >>>> So to fix, I suggest that we should do a
> >>>> slot-specific StartupReorderBuffer-style deletion when we start a new
> >>>> decoding session on a slot, per attached patch.
> >>>>
> >>>> It might be nice to also add a hook on proc exit, so we don't have stale
> >>>> buffers lying around until next decoding session, but I didn't want to
> >>>> add new global state to reorderbuffer.c so I've left that for now.
> >>>
> >>>
> >>> Hmm, can't we simply call the new cleanup function in
> >>> ReplicationSlotRelease()? That's called during process exit and we know
> >>> there about slot so no extra global variables are needed.
> >>>
> >>
> >> I guess that ReplicationSlotRelease() currently might not get called
> >> if walsender exits by proc_exit(). ReplicationSlotRelease() can is
> >> called by some functions such as WalSndErrorCleanup(), but at least in
> >> the case where wal sender exits due to failed to write data to socket,
> >> ReplicationSlotRelease() didn't get called as far as I tested.
> >>
> >
> > Are you sure about that testing? Did you test it with replication slot
> > active? ReplicationSlotRelease() is called from ProcKill() if the
> > process is using a slot and should be called for any kind of exit except
> > for outright crash (the kind that makes postgres kill all backends). If
> > it wasn't we'd never unlock the replication slot used by the exiting
> > walsender.
>
> Ah, sorry I was wrong. WalSndErrorCleanup() didn't get called but
> ReplicationSlotRelease() got called. I agree that cleanup function
> gets called in ReplicationSlotRelease().

This patch is currently in 'Waiting on Author' state, but looks like it
should actually be in Needs Review (or perhaps Ready for Commmitter)..?

Craig, would you agree with that and, if so, please update the status
accordingly.


WoA looks correct, Petr suggested a tweak to how and when the cleanup is done that I need to adopt.

I'm just about out of time before a trip, but I'll get a colleague to update it if I don't get the chance, so we can get it in and backpatched for this CF. 

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Condition variable live lock
Следующее
От: Thomas Munro
Дата:
Сообщение: Re: Condition variable live lock