Обсуждение: BUGFIX: standby disconnect can corrupt serialized reorder buffers
Hi all
It's not a problem on crash restart because StartupReorderBuffer already does the required delete.
--
TL;DR: we should delete pg_replslot/$SLOTNAME/* at the start of each decoding session or we can accidentally re-use stale reorder buffer contents from prior runs and $BADNESS happens. StartupReorderBuffer() is not sufficient.
Details:
Petr and I have found a bug in logical decoding where changes get appended multiple times to serialized reorder buffers. This causes duplicate changes sent to downstream (conflicts, ERRORs, etc).
Symptoms are:
* Oversized serialized reorder buffers in pg_replslot. In the case we found this problem in, there was a 147GB reorder buffer that should only have been 12GB.
* Downstream/receiver working hard but achieving nothing (pglogical/bdr with conflict resolution enabled), or failing with unique violations and other errors (built-in logical replication)
Debugging showed that logical decoding was calling the output plugin many times with the same set of ReorderBufferChange records. It'd advance normally, then go back to the start of a page address or similar and go through them all all over again.
Log analysis showed that the downstream had been repeatedly connecting to the upstream, then ERRORing out, so the upstream logs were full of
LOG: could not receive data from client: Connection reset by peer
LOG: unexpected EOF on standby connection
When the downstream error was fixed, the repeated changes were seen.
The cause appears to be that walsender.c's ProcessRepliesIfAny writes a LOG for unexpected EOF then calls proc_exit(0). But serialized txn cleanup is done by
ReorderBufferRestoreCleanup, as called by ReorderBufferCleanupTXN, which is invoked from the PG_CATCH() in ReorderBufferCommit() and on various normal exits. It won't get called if we proc_exit() without an ERROR, so we leave stale data lying around.It's not a problem on crash restart because StartupReorderBuffer already does the required delete.
ReorderBufferSerializeTXN, which spills the txns to disk, doesn't appear to have any guard to ensure that the segment files don't already exist when it goes to serialize a snapshot. Adding one there would probably be expensive; we don't know the last lsn of the txn yet, so to be really safe we'd have to do a directory listing and scan for any txn-$OURXID-* entries.
So to fix, I suggest that we should do a slot-specific StartupReorderBuffer-style deletion when we start a new decoding session on a slot, per attached patch.
It might be nice to also add a hook on proc exit, so we don't have stale buffers lying around until next decoding session, but I didn't want to add new global state to reorderbuffer.c so I've left that for now.
BTW, while this bug looks similar to https://www.postgresql.org/message-id/54e4e488-186b-a056-6628-50628e4e4ebc@lab.ntt.co.jp "Failed to delete old ReorderBuffer spilled files" it's really quite a separate issue.
Both this bugfix and the above need backpatching to 9.4.
Вложения
Hi, thanks for writing the patch. On 05/12/17 06:58, Craig Ringer wrote: > Hi all > > [...] >> The cause appears to be that walsender.c's ProcessRepliesIfAny writes a > LOG for unexpected EOF then calls proc_exit(0). But serialized txn > cleanup is done by > ReorderBufferRestoreCleanup, as called by ReorderBufferCleanupTXN, which > is invoked from the PG_CATCH() in ReorderBufferCommit() and on various > normal exits. It won't get called if we proc_exit() without an ERROR, so > we leave stale data lying around. > > It's not a problem on crash restart because StartupReorderBuffer already > does the required delete. > > ReorderBufferSerializeTXN, which spills the txns to disk, doesn't appear > to have any guard to ensure that the segment files don't already exist > when it goes to serialize a snapshot. Adding one there would probably be > expensive; we don't know the last lsn of the txn yet, so to be really > safe we'd have to do a directory listing and scan for any txn-$OURXID-* > entries. > > So to fix, I suggest that we should do a > slot-specific StartupReorderBuffer-style deletion when we start a new > decoding session on a slot, per attached patch. > > It might be nice to also add a hook on proc exit, so we don't have stale > buffers lying around until next decoding session, but I didn't want to > add new global state to reorderbuffer.c so I've left that for now. Hmm, can't we simply call the new cleanup function in ReplicationSlotRelease()? That's called during process exit and we know there about slot so no extra global variables are needed. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Dec 26, 2017 at 12:49 AM, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote: > Hi, > > thanks for writing the patch. > > On 05/12/17 06:58, Craig Ringer wrote: >> Hi all >> >> [...] >>> The cause appears to be that walsender.c's ProcessRepliesIfAny writes a >> LOG for unexpected EOF then calls proc_exit(0). But serialized txn >> cleanup is done by >> ReorderBufferRestoreCleanup, as called by ReorderBufferCleanupTXN, which >> is invoked from the PG_CATCH() in ReorderBufferCommit() and on various >> normal exits. It won't get called if we proc_exit() without an ERROR, so >> we leave stale data lying around. Thank you for the details. I could reproduce this bug. This bug also happens if pq_flush_if_writable called by WalSndWriteData could not write data (for example, the case where replicated data violate unique constraint on the subscriber). >> >> It's not a problem on crash restart because StartupReorderBuffer already >> does the required delete. >> >> ReorderBufferSerializeTXN, which spills the txns to disk, doesn't appear >> to have any guard to ensure that the segment files don't already exist >> when it goes to serialize a snapshot. Adding one there would probably be >> expensive; we don't know the last lsn of the txn yet, so to be really >> safe we'd have to do a directory listing and scan for any txn-$OURXID-* >> entries. >> >> So to fix, I suggest that we should do a >> slot-specific StartupReorderBuffer-style deletion when we start a new >> decoding session on a slot, per attached patch. >> >> It might be nice to also add a hook on proc exit, so we don't have stale >> buffers lying around until next decoding session, but I didn't want to >> add new global state to reorderbuffer.c so I've left that for now. > > > Hmm, can't we simply call the new cleanup function in > ReplicationSlotRelease()? That's called during process exit and we know > there about slot so no extra global variables are needed. > I guess that ReplicationSlotRelease() currently might not get called if walsender exits by proc_exit(). ReplicationSlotRelease() can is called by some functions such as WalSndErrorCleanup(), but at least in the case where wal sender exits due to failed to write data to socket, ReplicationSlotRelease() didn't get called as far as I tested. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On 26/12/17 11:13, Masahiko Sawada wrote: > On Tue, Dec 26, 2017 at 12:49 AM, Petr Jelinek > <petr.jelinek@2ndquadrant.com> wrote: > >>> >>> It's not a problem on crash restart because StartupReorderBuffer already >>> does the required delete. >>> >>> ReorderBufferSerializeTXN, which spills the txns to disk, doesn't appear >>> to have any guard to ensure that the segment files don't already exist >>> when it goes to serialize a snapshot. Adding one there would probably be >>> expensive; we don't know the last lsn of the txn yet, so to be really >>> safe we'd have to do a directory listing and scan for any txn-$OURXID-* >>> entries. >>> >>> So to fix, I suggest that we should do a >>> slot-specific StartupReorderBuffer-style deletion when we start a new >>> decoding session on a slot, per attached patch. >>> >>> It might be nice to also add a hook on proc exit, so we don't have stale >>> buffers lying around until next decoding session, but I didn't want to >>> add new global state to reorderbuffer.c so I've left that for now. >> >> >> Hmm, can't we simply call the new cleanup function in >> ReplicationSlotRelease()? That's called during process exit and we know >> there about slot so no extra global variables are needed. >> > > I guess that ReplicationSlotRelease() currently might not get called > if walsender exits by proc_exit(). ReplicationSlotRelease() can is > called by some functions such as WalSndErrorCleanup(), but at least in > the case where wal sender exits due to failed to write data to socket, > ReplicationSlotRelease() didn't get called as far as I tested. > Are you sure about that testing? Did you test it with replication slot active? ReplicationSlotRelease() is called from ProcKill() if the process is using a slot and should be called for any kind of exit except for outright crash (the kind that makes postgres kill all backends). If it wasn't we'd never unlock the replication slot used by the exiting walsender. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Dec 26, 2017 at 10:03 PM, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote: > On 26/12/17 11:13, Masahiko Sawada wrote: >> On Tue, Dec 26, 2017 at 12:49 AM, Petr Jelinek >> <petr.jelinek@2ndquadrant.com> wrote: >> >>>> >>>> It's not a problem on crash restart because StartupReorderBuffer already >>>> does the required delete. >>>> >>>> ReorderBufferSerializeTXN, which spills the txns to disk, doesn't appear >>>> to have any guard to ensure that the segment files don't already exist >>>> when it goes to serialize a snapshot. Adding one there would probably be >>>> expensive; we don't know the last lsn of the txn yet, so to be really >>>> safe we'd have to do a directory listing and scan for any txn-$OURXID-* >>>> entries. >>>> >>>> So to fix, I suggest that we should do a >>>> slot-specific StartupReorderBuffer-style deletion when we start a new >>>> decoding session on a slot, per attached patch. >>>> >>>> It might be nice to also add a hook on proc exit, so we don't have stale >>>> buffers lying around until next decoding session, but I didn't want to >>>> add new global state to reorderbuffer.c so I've left that for now. >>> >>> >>> Hmm, can't we simply call the new cleanup function in >>> ReplicationSlotRelease()? That's called during process exit and we know >>> there about slot so no extra global variables are needed. >>> >> >> I guess that ReplicationSlotRelease() currently might not get called >> if walsender exits by proc_exit(). ReplicationSlotRelease() can is >> called by some functions such as WalSndErrorCleanup(), but at least in >> the case where wal sender exits due to failed to write data to socket, >> ReplicationSlotRelease() didn't get called as far as I tested. >> > > Are you sure about that testing? Did you test it with replication slot > active? ReplicationSlotRelease() is called from ProcKill() if the > process is using a slot and should be called for any kind of exit except > for outright crash (the kind that makes postgres kill all backends). If > it wasn't we'd never unlock the replication slot used by the exiting > walsender. > Ah, sorry I was wrong. WalSndErrorCleanup() didn't get called but ReplicationSlotRelease() got called. I agree that cleanup function gets called in ReplicationSlotRelease(). Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Greetings all, * Masahiko Sawada (sawada.mshk@gmail.com) wrote: > On Tue, Dec 26, 2017 at 10:03 PM, Petr Jelinek > <petr.jelinek@2ndquadrant.com> wrote: > > On 26/12/17 11:13, Masahiko Sawada wrote: > >> On Tue, Dec 26, 2017 at 12:49 AM, Petr Jelinek > >> <petr.jelinek@2ndquadrant.com> wrote: > >> > >>>> > >>>> It's not a problem on crash restart because StartupReorderBuffer already > >>>> does the required delete. > >>>> > >>>> ReorderBufferSerializeTXN, which spills the txns to disk, doesn't appear > >>>> to have any guard to ensure that the segment files don't already exist > >>>> when it goes to serialize a snapshot. Adding one there would probably be > >>>> expensive; we don't know the last lsn of the txn yet, so to be really > >>>> safe we'd have to do a directory listing and scan for any txn-$OURXID-* > >>>> entries. > >>>> > >>>> So to fix, I suggest that we should do a > >>>> slot-specific StartupReorderBuffer-style deletion when we start a new > >>>> decoding session on a slot, per attached patch. > >>>> > >>>> It might be nice to also add a hook on proc exit, so we don't have stale > >>>> buffers lying around until next decoding session, but I didn't want to > >>>> add new global state to reorderbuffer.c so I've left that for now. > >>> > >>> > >>> Hmm, can't we simply call the new cleanup function in > >>> ReplicationSlotRelease()? That's called during process exit and we know > >>> there about slot so no extra global variables are needed. > >>> > >> > >> I guess that ReplicationSlotRelease() currently might not get called > >> if walsender exits by proc_exit(). ReplicationSlotRelease() can is > >> called by some functions such as WalSndErrorCleanup(), but at least in > >> the case where wal sender exits due to failed to write data to socket, > >> ReplicationSlotRelease() didn't get called as far as I tested. > >> > > > > Are you sure about that testing? Did you test it with replication slot > > active? ReplicationSlotRelease() is called from ProcKill() if the > > process is using a slot and should be called for any kind of exit except > > for outright crash (the kind that makes postgres kill all backends). If > > it wasn't we'd never unlock the replication slot used by the exiting > > walsender. > > Ah, sorry I was wrong. WalSndErrorCleanup() didn't get called but > ReplicationSlotRelease() got called. I agree that cleanup function > gets called in ReplicationSlotRelease(). This patch is currently in 'Waiting on Author' state, but looks like it should actually be in Needs Review (or perhaps Ready for Commmitter)..? Craig, would you agree with that and, if so, please update the status accordingly. Thanks! Stephen
Вложения
On 5 January 2018 at 12:16, Stephen Frost <sfrost@snowman.net> wrote:
Greetings all,This patch is currently in 'Waiting on Author' state, but looks like it
* Masahiko Sawada (sawada.mshk@gmail.com) wrote:
> On Tue, Dec 26, 2017 at 10:03 PM, Petr Jelinek
> <petr.jelinek@2ndquadrant.com> wrote:
> > On 26/12/17 11:13, Masahiko Sawada wrote:
> >> On Tue, Dec 26, 2017 at 12:49 AM, Petr Jelinek
> >> <petr.jelinek@2ndquadrant.com> wrote:
> >>
> >>>>
> >>>> It's not a problem on crash restart because StartupReorderBuffer already
> >>>> does the required delete.
> >>>>
> >>>> ReorderBufferSerializeTXN, which spills the txns to disk, doesn't appear
> >>>> to have any guard to ensure that the segment files don't already exist
> >>>> when it goes to serialize a snapshot. Adding one there would probably be
> >>>> expensive; we don't know the last lsn of the txn yet, so to be really
> >>>> safe we'd have to do a directory listing and scan for any txn-$OURXID-*
> >>>> entries.
> >>>>
> >>>> So to fix, I suggest that we should do a
> >>>> slot-specific StartupReorderBuffer-style deletion when we start a new
> >>>> decoding session on a slot, per attached patch.
> >>>>
> >>>> It might be nice to also add a hook on proc exit, so we don't have stale
> >>>> buffers lying around until next decoding session, but I didn't want to
> >>>> add new global state to reorderbuffer.c so I've left that for now.
> >>>
> >>>
> >>> Hmm, can't we simply call the new cleanup function in
> >>> ReplicationSlotRelease()? That's called during process exit and we know
> >>> there about slot so no extra global variables are needed.
> >>>
> >>
> >> I guess that ReplicationSlotRelease() currently might not get called
> >> if walsender exits by proc_exit(). ReplicationSlotRelease() can is
> >> called by some functions such as WalSndErrorCleanup(), but at least in
> >> the case where wal sender exits due to failed to write data to socket,
> >> ReplicationSlotRelease() didn't get called as far as I tested.
> >>
> >
> > Are you sure about that testing? Did you test it with replication slot
> > active? ReplicationSlotRelease() is called from ProcKill() if the
> > process is using a slot and should be called for any kind of exit except
> > for outright crash (the kind that makes postgres kill all backends). If
> > it wasn't we'd never unlock the replication slot used by the exiting
> > walsender.
>
> Ah, sorry I was wrong. WalSndErrorCleanup() didn't get called but
> ReplicationSlotRelease() got called. I agree that cleanup function
> gets called in ReplicationSlotRelease().
should actually be in Needs Review (or perhaps Ready for Commmitter)..?
Craig, would you agree with that and, if so, please update the status
accordingly.
WoA looks correct, Petr suggested a tweak to how and when the cleanup is done that I need to adopt.
I'm just about out of time before a trip, but I'll get a colleague to update it if I don't get the chance, so we can get it in and backpatched for this CF.
I think this should use ReadDirExtended with an elevel less than ERROR, and do nothing. Why have strcmp(.) and strcmp(..)? These are going to be skipped by the comparison to "xid" prefix anyway. Looks like strcmp processing power waste. Please don't use bare sprintf() -- upgrade to snprintf. With this coding, if I put a root-owned file "xidfoo" in a replslot directory, it will PANIC the server. Is that okay? Why not read the file name with sscanf(), since we know the precise format it has? Then we don't need to bother with random crap left around. Maybe a good time to put the "xid-%u-lsn-%X-%X.snap" literal in a macro? I guess the rationale is that if you let random people put "xidfoo" files in your replication slot dirs, you deserve a PANIC anyway, but I'm not sure. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 6 January 2018 at 08:28, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
I think this should use ReadDirExtended with an elevel less than ERROR,
and do nothing.
Why have strcmp(.) and strcmp(..)? These are going to be skipped by the
comparison to "xid" prefix anyway. Looks like strcmp processing power waste.
Please don't use bare sprintf() -- upgrade to snprintf.
With this coding, if I put a root-owned file "xidfoo" in a replslot
directory, it will PANIC the server. Is that okay? Why not read the
file name with sscanf(), since we know the precise format it has? Then
we don't need to bother with random crap left around. Maybe a good time
to put the "xid-%u-lsn-%X-%X.snap" literal in a macro? I guess the
rationale is that if you let random people put "xidfoo" files in your
replication slot dirs, you deserve a PANIC anyway, but I'm not sure.
I'm happy to address those comments.
The PANIC probably made sense when it was only done on startup, but not now it's at runtime.
The rest is mainly retained from the prior code, but it's a good chance to make those changes.
Hi Craig, On 1/21/18 5:45 PM, Craig Ringer wrote: > On 6 January 2018 at 08:28, Alvaro Herrera <alvherre@alvh.no-ip.org > <mailto:alvherre@alvh.no-ip.org>> wrote: > > I think this should use ReadDirExtended with an elevel less than ERROR, > and do nothing. > > Why have strcmp(.) and strcmp(..)? These are going to be skipped by the > comparison to "xid" prefix anyway. Looks like strcmp processing > power waste. > > Please don't use bare sprintf() -- upgrade to snprintf. > > With this coding, if I put a root-owned file "xidfoo" in a replslot > directory, it will PANIC the server. Is that okay? Why not read the > file name with sscanf(), since we know the precise format it has? Then > we don't need to bother with random crap left around. Maybe a good time > to put the "xid-%u-lsn-%X-%X.snap" literal in a macro? I guess the > rationale is that if you let random people put "xidfoo" files in your > replication slot dirs, you deserve a PANIC anyway, but I'm not sure. > > I'm happy to address those comments. > > The PANIC probably made sense when it was only done on startup, but not > now it's at runtime. > > The rest is mainly retained from the prior code, but it's a good chance > to make those changes. This patch was marked Waiting on Author last December. Do you know when you'll have a chance to provide an updated patch? Given that it's a bug fix it would be good to see a patch for this CF, or very soon after. Thanks, -- -David david@pgmasters.net
On 5 March 2018 at 23:25, David Steele <david@pgmasters.net> wrote:
Hi Craig,
On 1/21/18 5:45 PM, Craig Ringer wrote:
> On 6 January 2018 at 08:28, Alvaro Herrera <alvherre@alvh.no-ip.org
> <mailto:alvherre@alvh.no-ip.org>> wrote:
>
> I think this should use ReadDirExtended with an elevel less than ERROR,
> and do nothing.
>
> Why have strcmp(.) and strcmp(..)? These are going to be skipped by the
> comparison to "xid" prefix anyway. Looks like strcmp processing
> power waste.
>
> Please don't use bare sprintf() -- upgrade to snprintf.
>
> With this coding, if I put a root-owned file "xidfoo" in a replslot
> directory, it will PANIC the server. Is that okay? Why not read the
> file name with sscanf(), since we know the precise format it has? Then
> we don't need to bother with random crap left around. Maybe a good time
> to put the "xid-%u-lsn-%X-%X.snap" literal in a macro? I guess the
> rationale is that if you let random people put "xidfoo" files in your
> replication slot dirs, you deserve a PANIC anyway, but I'm not sure.
>
> I'm happy to address those comments.
>
> The PANIC probably made sense when it was only done on startup, but not
> now it's at runtime.
>
> The rest is mainly retained from the prior code, but it's a good chance
> to make those changes.
This patch was marked Waiting on Author last December. Do you know when
you'll have a chance to provide an updated patch?
Given that it's a bug fix it would be good to see a patch for this CF,
or very soon after.
Thanks for the reminder. I'll address it today if I can.
On 6 March 2018 at 09:58, Craig Ringer <craig@2ndquadrant.com> wrote:
On 5 March 2018 at 23:25, David Steele <david@pgmasters.net> wrote:Hi Craig,
On 1/21/18 5:45 PM, Craig Ringer wrote:
> On 6 January 2018 at 08:28, Alvaro Herrera <alvherre@alvh.no-ip.org
> <mailto:alvherre@alvh.no-ip.org>> wrote:
>
> I think this should use ReadDirExtended with an elevel less than ERROR,
> and do nothing.
>
> Why have strcmp(.) and strcmp(..)? These are going to be skipped by the
> comparison to "xid" prefix anyway. Looks like strcmp processing
> power waste.
>
> Please don't use bare sprintf() -- upgrade to snprintf.
>
> With this coding, if I put a root-owned file "xidfoo" in a replslot
> directory, it will PANIC the server. Is that okay? Why not read the
> file name with sscanf(), since we know the precise format it has? Then
> we don't need to bother with random crap left around. Maybe a good time
> to put the "xid-%u-lsn-%X-%X.snap" literal in a macro? I guess the
> rationale is that if you let random people put "xidfoo" files in your
> replication slot dirs, you deserve a PANIC anyway, but I'm not sure.
>
> I'm happy to address those comments.
>
> The PANIC probably made sense when it was only done on startup, but not
> now it's at runtime.
>
> The rest is mainly retained from the prior code, but it's a good chance
> to make those changes.
This patch was marked Waiting on Author last December. Do you know when
you'll have a chance to provide an updated patch?
Given that it's a bug fix it would be good to see a patch for this CF,
or very soon after.Thanks for the reminder. I'll address it today if I can.
Revised patch attached.
I have _not_ rewritten to use sscanf yet. I'll do that next, so you can choose the fewer-changes patch for backpatching if desired.
Вложения
On 6 March 2018 at 16:07, Craig Ringer <craig@2ndquadrant.com> wrote:
On 6 March 2018 at 09:58, Craig Ringer <craig@2ndquadrant.com> wrote:On 5 March 2018 at 23:25, David Steele <david@pgmasters.net> wrote:Hi Craig,
On 1/21/18 5:45 PM, Craig Ringer wrote:
> On 6 January 2018 at 08:28, Alvaro Herrera <alvherre@alvh.no-ip.org
> <mailto:alvherre@alvh.no-ip.org>> wrote:
>
> I think this should use ReadDirExtended with an elevel less than ERROR,
> and do nothing.
>
> Why have strcmp(.) and strcmp(..)? These are going to be skipped by the
> comparison to "xid" prefix anyway. Looks like strcmp processing
> power waste.
>
> Please don't use bare sprintf() -- upgrade to snprintf.
>
> With this coding, if I put a root-owned file "xidfoo" in a replslot
> directory, it will PANIC the server. Is that okay? Why not read the
> file name with sscanf(), since we know the precise format it has? Then
> we don't need to bother with random crap left around. Maybe a good time
> to put the "xid-%u-lsn-%X-%X.snap" literal in a macro? I guess the
> rationale is that if you let random people put "xidfoo" files in your
> replication slot dirs, you deserve a PANIC anyway, but I'm not sure.
>
> I'm happy to address those comments.
>
> The PANIC probably made sense when it was only done on startup, but not
> now it's at runtime.
>
> The rest is mainly retained from the prior code, but it's a good chance
> to make those changes.
This patch was marked Waiting on Author last December. Do you know when
you'll have a chance to provide an updated patch?
Given that it's a bug fix it would be good to see a patch for this CF,
or very soon after.Thanks for the reminder. I'll address it today if I can.Revised patch attached.I have _not_ rewritten to use sscanf yet. I'll do that next, so you can choose the fewer-changes patch for backpatching if desired.
... and I'm not convinced it's really an improvement.
uint32 xid, lsn_high, lsn_low;
if (sscanf(spill_de->d_name, REORDERBUFFER_TXN_FILENAME_FORMAT,
xid, lsn_high, lsn_low) == 3)
{
since we don't use the scanned-out information.
I guess my answer to causing problems if you create a file named "xidfoo" in a slot directory is yeah, don't do that.
Note that this patch changes the PANIC to ERROR. This will promote to FATAL during the startup process, which I think is fine. Objections?
On 06/03/18 09:37, Craig Ringer wrote: > > Revised patch attached. > > I have _not_ rewritten to use sscanf yet. I'll do that next, so you > can choose the fewer-changes patch for backpatching if desired. > > > ... and I'm not convinced it's really an improvement. > > uint32 xid, lsn_high, lsn_low; > > if (sscanf(spill_de->d_name, REORDERBUFFER_TXN_FILENAME_FORMAT, > xid, lsn_high, lsn_low) == 3) > { > > since we don't use the scanned-out information. > > I guess my answer to causing problems if you create a file named > "xidfoo" in a slot directory is yeah, don't do that. > > Note that this patch changes the PANIC to ERROR. This will promote to > FATAL during the startup process, which I think is fine. Objections? > You mean because PG_exception_stack is not initialized for startup process? That deserves comment I think. Other than that if I am very nitpicky, I'd call the new function ReorderBufferCleanupSerializedTXNs. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Craig Ringer wrote: > Revised patch attached. > > I have _not_ rewritten to use sscanf yet. I'll do that next, so you can > choose the fewer-changes patch for backpatching if desired. Pushed, with a further change: it seems more sensible to centralize the whole operation of building the path, rather than just the format string, so I created a new function to do that. The code looks cleaner IMO this way IMO. All tests pass in all branches. BTW the way the XLogSegNoOffsetToRecPtr() et al macros were modified to accept wal_segment_size at the end of the argument list, *after* the output argument, seems pretty bad style. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services