Обсуждение: standalone backend PANICs during recovery

Поиск
Список
Период
Сортировка

standalone backend PANICs during recovery

От
Bernd Helmle
Дата:
While investigating a problem on a streaming hot standby instance at a
customer site, i got the following when using a standalone backend:

PANIC:  btree_xlog_delete_get_latestRemovedXid: cannot operate with
inconsistent data
CONTEXT:  xlog redo delete: index 1663/65588/65625; iblk 11, heap
1663/65588/65613;

The responsible PANIC is triggered here:

src/backend/access/nbtree/nbtxlog.c:555

btree_xlog_delete_get_latestRemovedXid(XLogReaderState *record)
{

[...]
    if (!reachedConsistency)
        elog(PANIC, "btree_xlog_delete_get_latestRemovedXid: cannot operate with
inconsistent data");
[...]

}

There's already an "optimization" before, exiting with InvalidTransactionId
in case a HS doesn't count any active backends. In standalone mode however,
CountDBBackends() will always return 1 afaik. It looks like
btree_xlog_delete_get_latestRemovedXid() needs an additional check for
standalone backends, so i came up with the attached patch. This allowed the
standalone backend to recover without any problems.

--
Thanks

    Bernd
Вложения

Re: standalone backend PANICs during recovery

От
Alvaro Herrera
Дата:
Bernd Helmle wrote:
> While investigating a problem on a streaming hot standby instance at a
> customer site, i got the following when using a standalone backend:
> 
> PANIC:  btree_xlog_delete_get_latestRemovedXid: cannot operate with
> inconsistent data
> CONTEXT:  xlog redo delete: index 1663/65588/65625; iblk 11, heap
> 1663/65588/65613;
> 
> The responsible PANIC is triggered here:
> 
> src/backend/access/nbtree/nbtxlog.c:555
> 
> btree_xlog_delete_get_latestRemovedXid(XLogReaderState *record)

This PANIC was introduced in the commit below.  AFAICS your proposed
patch and rationale are correct.


commit f786e91a75b2f64527dcf321e754b6448fcad7fe
Author:     Tom Lane <tgl@sss.pgh.pa.us>
AuthorDate: Fri Aug 3 15:41:18 2012 -0400
CommitDate: Fri Aug 3 15:41:18 2012 -0400
   Improve underdocumented btree_xlog_delete_get_latestRemovedXid() code.      As noted by Noah Misch,
btree_xlog_delete_get_latestRemovedXidis   critically dependent on the assumption that it's examining a consistent
stateof the database.  This was undocumented though, so the   seemingly-unrelated check for no active HS sessions might
bethought to be   merely an optional optimization.  Improve comments, and add an explicit   check of reachedConsistency
justto be sure.      This function returns InvalidTransactionId (thereby killing all HS   transactions) in several
casesthat are not nearly unlikely enough for my   taste.  This commit doesn't attempt to fix those deficiencies, just
documentthem.      Back-patch to 9.2, not from any real functional need but just to keep the   branches more closely
syncedto simplify possible future back-patching.
 



-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: standalone backend PANICs during recovery

От
Robert Haas
Дата:
On Sat, Apr 2, 2016 at 5:57 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Bernd Helmle wrote:
>> While investigating a problem on a streaming hot standby instance at a
>> customer site, i got the following when using a standalone backend:
>>
>> PANIC:  btree_xlog_delete_get_latestRemovedXid: cannot operate with
>> inconsistent data
>> CONTEXT:  xlog redo delete: index 1663/65588/65625; iblk 11, heap
>> 1663/65588/65613;
>>
>> The responsible PANIC is triggered here:
>>
>> src/backend/access/nbtree/nbtxlog.c:555
>>
>> btree_xlog_delete_get_latestRemovedXid(XLogReaderState *record)
>
> This PANIC was introduced in the commit below.  AFAICS your proposed
> patch and rationale are correct.

If nobody's going to commit this right away, this should be added to
the next CommitFest so we don't forget it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: standalone backend PANICs during recovery

От
Bernd Helmle
Дата:
--On 5. April 2016 21:50:17 -0400 Robert Haas <robertmhaas@gmail.com> wrote:

> If nobody's going to commit this right away, this should be added to
> the next CommitFest so we don't forget it.

Done.

-- 
Thanks
Bernd



Re: standalone backend PANICs during recovery

От
Tom Lane
Дата:
[ got around to looking at this finally ]

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Bernd Helmle wrote:
>> While investigating a problem on a streaming hot standby instance at a
>> customer site, i got the following when using a standalone backend:
>> 
>> PANIC:  btree_xlog_delete_get_latestRemovedXid: cannot operate with
>> inconsistent data
>> CONTEXT:  xlog redo delete: index 1663/65588/65625; iblk 11, heap
>> 1663/65588/65613;
>> 
>> The responsible PANIC is triggered here:
>> src/backend/access/nbtree/nbtxlog.c:555

> This PANIC was introduced in the commit below.  AFAICS your proposed
> patch and rationale are correct.

I'm not very convinced by this.  What seems to me to be the core issue is
how it can make sense for InHotStandby to read as TRUE in a standalone
backend.  The best-case situation if that's true is that the standalone
backend will do a lot of unnecessary work on the baseless assumption that
there are other backends it has to keep things valid for.  The worst-case
situation is that you trip over some bug and are unable to complete
recovery, even though you would have if the hot-standby support hadn't
been activated.

In short, I don't think control should have been here at all.  My proposal
for a fix is to force EnableHotStandby to remain false in a standalone
backend.
        regards, tom lane



Re: standalone backend PANICs during recovery

От
Tom Lane
Дата:
I wrote:
> In short, I don't think control should have been here at all.  My proposal
> for a fix is to force EnableHotStandby to remain false in a standalone
> backend.

I tried to reproduce Bernd's problem by starting a standalone backend in
a data directory that was configured as a hot standby slave, and soon
found that there are much bigger issues than this.  The startup sequence
soon tries to wait for WAL to arrive, which in HEAD uses
           WaitLatch(&XLogCtl->recoveryWakeupLatch,                     WL_LATCH_SET | WL_TIMEOUT |
WL_POSTMASTER_DEATH,                    5000L);
 

which immediately elog(FATAL)s because a standalone backend has no parent
postmaster and so postmaster_alive_fds[] isn't set.  But if it didn't do
that, it'd wait forever because of course there is no active WAL receiver
process that would ever provide more WAL.

The only way that you'd ever get to a command prompt is if somebody made a
promotion trigger file, which would cause the startup code to promote the
cluster into master status, which does not really seem like something that
would be a good idea in Bernd's proposed use case of "investigating a
problem".

Alternatively, if we were to force standby_mode off in a standalone
backend, it would come to the command prompt right away but again it would
have effectively promoted the cluster to master.  That is certainly not
something we should ever do automatically.

So at this point I'm pretty baffled as to what the actual use-case is
here.  I am tempted to say that a standalone backend should refuse to
start at all if a recovery.conf file is present.  If we do want to
allow the case, we need some careful thought about what it should do.
        regards, tom lane



Re: standalone backend PANICs during recovery

От
Michael Paquier
Дата:
On Sun, Aug 21, 2016 at 1:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> So at this point I'm pretty baffled as to what the actual use-case is
> here.

It is easier to attach a debugger in this case to analyze the problem?

> I am tempted to say that a standalone backend should refuse to
> start at all if a recovery.conf file is present.  If we do want to
> allow the case, we need some careful thought about what it should do.

+1 for preventing an instance in --single mode to start if
recovery.conf is present. It is better to put safeguards than getting
unwanted behaviors.
-- 
Michael



Re: standalone backend PANICs during recovery

От
Bernd Helmle
Дата:

--On 20. August 2016 12:41:48 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote:

> So at this point I'm pretty baffled as to what the actual use-case is
> here.  I am tempted to say that a standalone backend should refuse to
> start at all if a recovery.conf file is present.  If we do want to
> allow the case, we need some careful thought about what it should do.

Well, the "use case" was a crashed hot standby at a customer site (it kept
PANICing after restarting), where i decided to have a look through the
recovery process with gdb. The exact PANIC was

2016-03-29 15:12:40 CEST [16097]: [26-1] user=,db=,xid=0,vxid=1/0 FATAL:
unexpected GIN leaf action: 0

I had the idea that it was quick and dirty to use a single backend. I was
surprised that this time it PANIC'ed differently....

That said, i'm okay if --single is not intended to bring up a hot standby.
There are many other ways to debug such problems.

-- 
Thanks
Bernd



Re: standalone backend PANICs during recovery

От
Michael Paquier
Дата:
On Wed, Aug 24, 2016 at 5:07 PM, Bernd Helmle <mailings@oopsware.de> wrote:
> That said, i'm okay if --single is not intended to bring up a hot standby.
> There are many other ways to debug such problems.

This patch is still on the CF app:
https://commitfest.postgresql.org/10/610/
Even after thinking about it, my opinion is still the same. Let's
prevent a standalone backend to start if it sees recovery.conf.
Attached is a patch and the CF entry is switched as ready for
committer to move on.
--
Michael

Вложения

Re: standalone backend PANICs during recovery

От
Tom Lane
Дата:
Michael Paquier <michael.paquier@gmail.com> writes:
> On Wed, Aug 24, 2016 at 5:07 PM, Bernd Helmle <mailings@oopsware.de> wrote:
>> That said, i'm okay if --single is not intended to bring up a hot standby.
>> There are many other ways to debug such problems.

> This patch is still on the CF app:
> https://commitfest.postgresql.org/10/610/
> Even after thinking about it, my opinion is still the same. Let's
> prevent a standalone backend to start if it sees recovery.conf.
> Attached is a patch and the CF entry is switched as ready for
> committer to move on.

Hm, StartupXLOG seems like a pretty random place to check that, especially
since doing it there requires an extra stat() call.  Why didn't you just
make readRecoveryCommandFile() error out?
        regards, tom lane



Re: standalone backend PANICs during recovery

От
Michael Paquier
Дата:
On Tue, Aug 30, 2016 at 9:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Michael Paquier <michael.paquier@gmail.com> writes:
>> On Wed, Aug 24, 2016 at 5:07 PM, Bernd Helmle <mailings@oopsware.de> wrote:
>>> That said, i'm okay if --single is not intended to bring up a hot standby.
>>> There are many other ways to debug such problems.
>
>> This patch is still on the CF app:
>> https://commitfest.postgresql.org/10/610/
>> Even after thinking about it, my opinion is still the same. Let's
>> prevent a standalone backend to start if it sees recovery.conf.
>> Attached is a patch and the CF entry is switched as ready for
>> committer to move on.
>
> Hm, StartupXLOG seems like a pretty random place to check that, especially
> since doing it there requires an extra stat() call.  Why didn't you just
> make readRecoveryCommandFile() error out?

Well, the idea is to do the check before doing anything on PGDATA and
leave it intact, particularly the post-crash fsync().
-- 
Michael



Re: standalone backend PANICs during recovery

От
Tom Lane
Дата:
Michael Paquier <michael.paquier@gmail.com> writes:
> On Tue, Aug 30, 2016 at 9:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Hm, StartupXLOG seems like a pretty random place to check that, especially
>> since doing it there requires an extra stat() call.  Why didn't you just
>> make readRecoveryCommandFile() error out?

> Well, the idea is to do the check before doing anything on PGDATA and
> leave it intact, particularly the post-crash fsync().

I don't see anything very exciting between the beginning of StartupXLOG
and readRecoveryCommandFile.  In particular, doing the fsync seems like
a perfectly harmless and maybe-good thing.  If there were some operation
with potentially bad side-effects in that range, it would be dangerous
anyway because of the risk of readRecoveryCommandFile erroring out due
to invalid contents of recovery.conf.

This might be an argument for re-ordering what we're doing in StartupXLOG,
but that seems like an independent discussion.
        regards, tom lane



Re: standalone backend PANICs during recovery

От
Michael Paquier
Дата:
On Tue, Aug 30, 2016 at 10:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Michael Paquier <michael.paquier@gmail.com> writes:
>> On Tue, Aug 30, 2016 at 9:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Hm, StartupXLOG seems like a pretty random place to check that, especially
>>> since doing it there requires an extra stat() call.  Why didn't you just
>>> make readRecoveryCommandFile() error out?
>
>> Well, the idea is to do the check before doing anything on PGDATA and
>> leave it intact, particularly the post-crash fsync().
>
> I don't see anything very exciting between the beginning of StartupXLOG
> and readRecoveryCommandFile.  In particular, doing the fsync seems like
> a perfectly harmless and maybe-good thing.  If there were some operation
> with potentially bad side-effects in that range, it would be dangerous
> anyway because of the risk of readRecoveryCommandFile erroring out due
> to invalid contents of recovery.conf.

Does the attached suit better then?
--
Michael

Вложения

Re: standalone backend PANICs during recovery

От
Tom Lane
Дата:
Michael Paquier <michael.paquier@gmail.com> writes:
> Does the attached suit better then?

Thinking about it some more ... what we actually need to prevent, AFAICS,
is standby_mode becoming true in a standalone backend.  If you don't set
that, then the process will do PITR recovery, but I'm not aware that there
is any problem with running that standalone, and maybe it's useful to
allow it for debugging purposes.  So I'm thinking more like
/* * Check for recovery control file, and if so set up state for offline * recovery */readRecoveryCommandFile();

+    /*
+     * We don't support standby_mode in standalone backends; that
+     * requires other processes such as WAL receivers to be alive.
+     */
+    if (StandbyModeRequested && !IsUnderPostmaster)
+        ereport(FATAL, ...);
+/* * Save archive_cleanup_command in shared memory so that other processes * can see it. */

or we could put the check near the bottom of readRecoveryCommandFile.
Not sure which placement is clearer.
        regards, tom lane



Re: standalone backend PANICs during recovery

От
Michael Paquier
Дата:
On Tue, Aug 30, 2016 at 11:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Michael Paquier <michael.paquier@gmail.com> writes:
>> Does the attached suit better then?
>
> Thinking about it some more ... what we actually need to prevent, AFAICS,
> is standby_mode becoming true in a standalone backend.  If you don't set
> that, then the process will do PITR recovery, but I'm not aware that there
> is any problem with running that standalone, and maybe it's useful to
> allow it for debugging purposes.  So I'm thinking more like
>
>         /*
>          * Check for recovery control file, and if so set up state for offline
>          * recovery
>          */
>         readRecoveryCommandFile();
>
> +       /*
> +        * We don't support standby_mode in standalone backends; that
> +        * requires other processes such as WAL receivers to be alive.
> +        */
> +       if (StandbyModeRequested && !IsUnderPostmaster)
> +               ereport(FATAL, ...);
> +
>         /*
>          * Save archive_cleanup_command in shared memory so that other processes
>          * can see it.
>          */
>
> or we could put the check near the bottom of readRecoveryCommandFile.
> Not sure which placement is clearer.

I have spent some time playing with that and you are right. Only
standby_mode = on is able trigger a failure here, and the first one is
in WaitForWALToBecomeAvailable(). I'd just put the check at the end of
readRecoveryCommandFile(), this will avoid extra thinking should
readRecoveryCommandFile() be moved around. That's unlikely to happen
but it is a cheap insurance.

I have added a test on that using 001_stream_rep.pl, now that's not
actually a good idea to push this test as if it passes the execution
of postgres would just hang and prevent the test to continue to run
and move on. But it makes it easier for anybody looking at this patch
to test the pattern prevented here.
--
Michael

Вложения

Re: standalone backend PANICs during recovery

От
Tom Lane
Дата:
Michael Paquier <michael.paquier@gmail.com> writes:
> On Tue, Aug 30, 2016 at 11:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Thinking about it some more ... what we actually need to prevent, AFAICS,
>> is standby_mode becoming true in a standalone backend.

> I have spent some time playing with that and you are right. Only
> standby_mode = on is able trigger a failure here, and the first one is
> in WaitForWALToBecomeAvailable(). I'd just put the check at the end of
> readRecoveryCommandFile(), this will avoid extra thinking should
> readRecoveryCommandFile() be moved around. That's unlikely to happen
> but it is a cheap insurance.

Pushed with minor adjustments --- I moved the check to be closer to
existing sanity checks in readRecoveryCommandFile, tweaked the wording
(the man page for postgres refers to --single as "single-user server"
mode) and added an errcode.
        regards, tom lane



Re: standalone backend PANICs during recovery

От
Tom Lane
Дата:
Bernd Helmle <mailings@oopsware.de> writes:
> Well, the "use case" was a crashed hot standby at a customer site (it kept
> PANICing after restarting), where i decided to have a look through the
> recovery process with gdb. The exact PANIC was

> 2016-03-29 15:12:40 CEST [16097]: [26-1] user=,db=,xid=0,vxid=1/0 FATAL:
> unexpected GIN leaf action: 0

BTW, that didn't happen to be on big-endian hardware did it?
        regards, tom lane



Re: standalone backend PANICs during recovery

От
Bernd Helmle
Дата:

--On 3. September 2016 um 02:05:00 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote:

>> Well, the "use case" was a crashed hot standby at a customer site (it
>> kept PANICing after restarting), where i decided to have a look through
>> the recovery process with gdb. The exact PANIC was
> 
>> 2016-03-29 15:12:40 CEST [16097]: [26-1] user=,db=,xid=0,vxid=1/0 FATAL:
>> unexpected GIN leaf action: 0
> 
> BTW, that didn't happen to be on big-endian hardware did it?

You're right, this was RHEL7 on a POWER7 machine.

-- 
Thanks
Bernd