Обсуждение: Something is rotten in the state of Denmark...

Поиск
Список
Период
Сортировка

Something is rotten in the state of Denmark...

От
Tom Lane
Дата:
Observe these recent buildfarm failures:

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mule&dt=2015-03-21%2000%3A30%3A02
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=guaibasaurus&dt=2015-03-23%2004%3A17%3A01
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mule&dt=2015-03-31%2023%3A30%3A02

Three similar-looking failures, on two different machines, in a regression
test that has existed for less than three weeks.  Something is very wrong.

What's worse, the failure does not look like something that could have
been directly provoked by commit 31eae6028eca4365e7165f5f33fee1ed0486aee0;
I'm afraid it's just falling victim to a bug that may have been there for
awhile.  Given the evidently low probability of triggering the bug, it
may be hard to isolate :-(
        regards, tom lane



Re: Something is rotten in the state of Denmark...

От
Tom Lane
Дата:
I wrote:
> Observe these recent buildfarm failures:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mule&dt=2015-03-21%2000%3A30%3A02
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=guaibasaurus&dt=2015-03-23%2004%3A17%3A01
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mule&dt=2015-03-31%2023%3A30%3A02

> Three similar-looking failures, on two different machines, in a regression
> test that has existed for less than three weeks.  Something is very wrong.

I've been able to reproduce this.  The triggering event seems to be that
the "VACUUM FULL pg_am" in vacuum.sql has to happen while another backend
is starting up.  With a ten-second delay inserted at the bottom of
PerformAuthentication(), it's trivial to hit it manually.  The reason we'd
not seen this before the rolenames.sql test was added is that none of the
other tests that run concurrently with vacuum.sql perform mid-test
reconnections, or ever have AFAIR.  So as long as they all managed to
start up before vacuum.sql got to the dangerous step, no problem.

I've not fully tracked it down, but I think that the blame falls on the
MVCC-snapshots-for-catalog-scans patch; it appears that it's trying to
read pg_am's pg_class entry with a snapshot that's too old, possibly
because it assumes that sinval signaling is alive which I think ain't so.

For even more fun, try "VACUUM FULL pg_class" instead:

psql: PANIC:  could not open critical system index 2662
        regards, tom lane



Re: Something is rotten in the state of Denmark...

От
Robert Haas
Дата:
On Wed, Apr 1, 2015 at 7:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I've been able to reproduce this.  The triggering event seems to be that
> the "VACUUM FULL pg_am" in vacuum.sql has to happen while another backend
> is starting up.  With a ten-second delay inserted at the bottom of
> PerformAuthentication(), it's trivial to hit it manually.  The reason we'd
> not seen this before the rolenames.sql test was added is that none of the
> other tests that run concurrently with vacuum.sql perform mid-test
> reconnections, or ever have AFAIR.  So as long as they all managed to
> start up before vacuum.sql got to the dangerous step, no problem.
>
> I've not fully tracked it down, but I think that the blame falls on the
> MVCC-snapshots-for-catalog-scans patch; it appears that it's trying to
> read pg_am's pg_class entry with a snapshot that's too old, possibly
> because it assumes that sinval signaling is alive which I think ain't so.

Hmm, sorry for the bug.  It looks to me like sinval signaling gets
started up at the beginning of InitPostgres().  Perhaps something like
this?

diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 7cfa0cf..47c4002 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -303,7 +303,7 @@ GetNonHistoricCatalogSnapshot(Oid relid)                * Mark new snapshost as valid.  We must do
thislast,
 
in case an                * ERROR occurs inside GetSnapshotData().                */
-               CatalogSnapshotStale = false;
+               CatalogSnapshotStale = !IsNormalProcessingMode();       }
       return CatalogSnapshot;

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Something is rotten in the state of Denmark...

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Apr 1, 2015 at 7:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I've not fully tracked it down, but I think that the blame falls on the
>> MVCC-snapshots-for-catalog-scans patch; it appears that it's trying to
>> read pg_am's pg_class entry with a snapshot that's too old, possibly
>> because it assumes that sinval signaling is alive which I think ain't so.

> Hmm, sorry for the bug.  It looks to me like sinval signaling gets
> started up at the beginning of InitPostgres().  Perhaps something like
> this?

Yeah, it does, so you'd think this would work.  I've now tracked down
exactly what's happening, and it's this: while we're reading pg_authid
during PerformAuthentication, we establish a catalog snapshot.  Later on,
we read pg_database to find out what DB we're connecting to.  If any
sinval messages for unshared system catalogs arrive at that point, they'll
effectively be ignored, *because our MyDatabaseId is still zero and so
doesn't match the incoming message*.  That's okay as far as the catcaches
are concerned, cause they're empty, and the relcache only knows about
shared catalogs so far.  But InvalidateCatalogSnapshot doesn't get called
by LocalExecuteInvalidationMessage, and so we think we can plow ahead with
the old snapshot.

It looks to me like an appropriate fix would be as attached; thoughts?
(I've tested this with a delay during relation lock acquisition and
concurrent whole-database VACUUM FULLs, and so far been unable to break
it.)

            regards, tom lane

diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 1e646a1..6aa6868 100644
*** a/src/backend/utils/init/postinit.c
--- b/src/backend/utils/init/postinit.c
*************** InitPostgres(const char *in_dbname, Oid
*** 853,858 ****
--- 853,866 ----
      MyProc->databaseId = MyDatabaseId;

      /*
+      * We may have already established a catalog snapshot while trying to read
+      * pg_authid; but until we have set up MyDatabaseId, we won't react to
+      * incoming sinval messages properly, so we won't realize it if the
+      * snapshot has been invalidated.  Best to assume it's no good anymore.
+      */
+     InvalidateCatalogSnapshot();
+
+     /*
       * Now, take a writer's lock on the database we are trying to connect to.
       * If there is a concurrently running DROP DATABASE on that database, this
       * will block us until it finishes (and has committed its update of

Re: Something is rotten in the state of Denmark...

От
Robert Haas
Дата:
On Thu, Apr 2, 2015 at 12:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Wed, Apr 1, 2015 at 7:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I've not fully tracked it down, but I think that the blame falls on the
>>> MVCC-snapshots-for-catalog-scans patch; it appears that it's trying to
>>> read pg_am's pg_class entry with a snapshot that's too old, possibly
>>> because it assumes that sinval signaling is alive which I think ain't so.
>
>> Hmm, sorry for the bug.  It looks to me like sinval signaling gets
>> started up at the beginning of InitPostgres().  Perhaps something like
>> this?
>
> Yeah, it does, so you'd think this would work.  I've now tracked down
> exactly what's happening, and it's this: while we're reading pg_authid
> during PerformAuthentication, we establish a catalog snapshot.  Later on,
> we read pg_database to find out what DB we're connecting to.  If any
> sinval messages for unshared system catalogs arrive at that point, they'll
> effectively be ignored, *because our MyDatabaseId is still zero and so
> doesn't match the incoming message*.  That's okay as far as the catcaches
> are concerned, cause they're empty, and the relcache only knows about
> shared catalogs so far.  But InvalidateCatalogSnapshot doesn't get called
> by LocalExecuteInvalidationMessage, and so we think we can plow ahead with
> the old snapshot.
>
> It looks to me like an appropriate fix would be as attached; thoughts?
> (I've tested this with a delay during relation lock acquisition and
> concurrent whole-database VACUUM FULLs, and so far been unable to break
> it.)

Hmm, that fix doesn't reach as far as what I did.  My proposal would
regard a catalog snapshot as immediately stale, so if we're asked for
a catalog snapshot multiple times before InitPostgres() is called,
we'll take a new one every time.  Your proposal invalidates them just
once, after setting up MyDatabaseId.  Assuming yours is safe, it's
better, because it invalidates less aggressively.

The only thing I'm worried about is that I think
PerformAuthentication() runs before InitPostgres(); sinval isn't
working at all at that point, even for shared catalogs.  If we were to
lock a relation, build a relcache entry, unlock it, and then do that
again, all before SharedInvalBackendInit(false) is called, what would
keep us from failing to notice an intervening invalidation?  Maybe
there is something, because otherwise this was probably broken even
before the MVCC-catalog-snapshots patch, but I don't know what it is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Something is rotten in the state of Denmark...

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Apr 2, 2015 at 12:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> It looks to me like an appropriate fix would be as attached; thoughts?

> Hmm, that fix doesn't reach as far as what I did.  My proposal would
> regard a catalog snapshot as immediately stale, so if we're asked for
> a catalog snapshot multiple times before InitPostgres() is called,
> we'll take a new one every time.  Your proposal invalidates them just
> once, after setting up MyDatabaseId.  Assuming yours is safe, it's
> better, because it invalidates less aggressively.

Right.

> The only thing I'm worried about is that I think
> PerformAuthentication() runs before InitPostgres(); sinval isn't
> working at all at that point, even for shared catalogs.

No, PerformAuthentication is called by InitPostgres.

However, I'm having second thoughts about whether we've fully diagnosed
this.  Three out of the four failures we've seen in the buildfarm reported
"cache lookup failed for access method 403", not "could not open relation
with OID 2601" ... and I'm so far only able to replicate the latter
symptom.  It's really unclear how the former one could arise, because
nothing that vacuum.sql does would change xmin of the rows in pg_am.
        regards, tom lane



Re: Something is rotten in the state of Denmark...

От
Robert Haas
Дата:
On Thu, Apr 2, 2015 at 2:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Apr 2, 2015 at 12:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> It looks to me like an appropriate fix would be as attached; thoughts?
>
>> Hmm, that fix doesn't reach as far as what I did.  My proposal would
>> regard a catalog snapshot as immediately stale, so if we're asked for
>> a catalog snapshot multiple times before InitPostgres() is called,
>> we'll take a new one every time.  Your proposal invalidates them just
>> once, after setting up MyDatabaseId.  Assuming yours is safe, it's
>> better, because it invalidates less aggressively.
>
> Right.
>
>> The only thing I'm worried about is that I think
>> PerformAuthentication() runs before InitPostgres(); sinval isn't
>> working at all at that point, even for shared catalogs.
>
> No, PerformAuthentication is called by InitPostgres.

Oops, OK.

> However, I'm having second thoughts about whether we've fully diagnosed
> this.  Three out of the four failures we've seen in the buildfarm reported
> "cache lookup failed for access method 403", not "could not open relation
> with OID 2601" ... and I'm so far only able to replicate the latter
> symptom.  It's really unclear how the former one could arise, because
> nothing that vacuum.sql does would change xmin of the rows in pg_am.

It probably changes the *relfilenode* of pg_am, because it runs VACUUM
FULL on that catalog.  Perhaps some backend sees the old relfilenode
value and tries to a heap scan, interpreting the now-truncated file as
empty?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Something is rotten in the state of Denmark...

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Apr 2, 2015 at 2:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> However, I'm having second thoughts about whether we've fully diagnosed
>> this.  Three out of the four failures we've seen in the buildfarm reported
>> "cache lookup failed for access method 403", not "could not open relation
>> with OID 2601" ... and I'm so far only able to replicate the latter
>> symptom.  It's really unclear how the former one could arise, because
>> nothing that vacuum.sql does would change xmin of the rows in pg_am.

> It probably changes the *relfilenode* of pg_am, because it runs VACUUM
> FULL on that catalog.  Perhaps some backend sees the old relfilenode
> value and tries to a heap scan, interpreting the now-truncated file as
> empty?

Yeah, I came up with the same theory a few minutes later.  Trying to
reproduce on that basis.

Actually, now that I think it through, the "could not open relation"
error is pretty odd in itself.  If we are trying to open pg_am using
a stale catalog snapshot, it seems like we ought to reliably find its
old pg_class tuple (the one with the obsolete relfilenode), rather than
finding nothing.  But the latter is the behavior I'm seeing.
        regards, tom lane



Re: Something is rotten in the state of Denmark...

От
Robert Haas
Дата:
On Thu, Apr 2, 2015 at 2:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Apr 2, 2015 at 2:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> However, I'm having second thoughts about whether we've fully diagnosed
>>> this.  Three out of the four failures we've seen in the buildfarm reported
>>> "cache lookup failed for access method 403", not "could not open relation
>>> with OID 2601" ... and I'm so far only able to replicate the latter
>>> symptom.  It's really unclear how the former one could arise, because
>>> nothing that vacuum.sql does would change xmin of the rows in pg_am.
>
>> It probably changes the *relfilenode* of pg_am, because it runs VACUUM
>> FULL on that catalog.  Perhaps some backend sees the old relfilenode
>> value and tries to a heap scan, interpreting the now-truncated file as
>> empty?
>
> Yeah, I came up with the same theory a few minutes later.  Trying to
> reproduce on that basis.
>
> Actually, now that I think it through, the "could not open relation"
> error is pretty odd in itself.  If we are trying to open pg_am using
> a stale catalog snapshot, it seems like we ought to reliably find its
> old pg_class tuple (the one with the obsolete relfilenode), rather than
> finding nothing.  But the latter is the behavior I'm seeing.

What's to stop the old tuple from being HOT-pruned?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Something is rotten in the state of Denmark...

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Apr 2, 2015 at 2:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Actually, now that I think it through, the "could not open relation"
>> error is pretty odd in itself.  If we are trying to open pg_am using
>> a stale catalog snapshot, it seems like we ought to reliably find its
>> old pg_class tuple (the one with the obsolete relfilenode), rather than
>> finding nothing.  But the latter is the behavior I'm seeing.

> What's to stop the old tuple from being HOT-pruned?

Hm, that may be it.  I went back to the previous test scenario, and now
I can *only* get the "cache lookup failed for access method" behavior,
instead of what I was getting before, so I'm getting a bit confused :-(.
However, it does seem clear that the mechanism is indeed that we're
relying on an obsolete copy of pg_am's pg_class tuple, hence scanning a
truncated relfilenode, and that the patch I proposed fixes it.

Perhaps the difference has to do with whether pg_am's pg_class tuple is
on a page that hasn't got enough room for a HOT update?  But I definitely
tried it several times and consistently got the same failure before.
        regards, tom lane



Re: Something is rotten in the state of Denmark...

От
Robert Haas
Дата:
On Thu, Apr 2, 2015 at 3:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Apr 2, 2015 at 2:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Actually, now that I think it through, the "could not open relation"
>>> error is pretty odd in itself.  If we are trying to open pg_am using
>>> a stale catalog snapshot, it seems like we ought to reliably find its
>>> old pg_class tuple (the one with the obsolete relfilenode), rather than
>>> finding nothing.  But the latter is the behavior I'm seeing.
>
>> What's to stop the old tuple from being HOT-pruned?
>
> Hm, that may be it.  I went back to the previous test scenario, and now
> I can *only* get the "cache lookup failed for access method" behavior,
> instead of what I was getting before, so I'm getting a bit confused :-(.
> However, it does seem clear that the mechanism is indeed that we're
> relying on an obsolete copy of pg_am's pg_class tuple, hence scanning a
> truncated relfilenode, and that the patch I proposed fixes it.
>
> Perhaps the difference has to do with whether pg_am's pg_class tuple is
> on a page that hasn't got enough room for a HOT update?  But I definitely
> tried it several times and consistently got the same failure before.

That seems plausible, except that I have no idea why that would vary
from one test setup to another.  I suggest pushing the patch you
proposed upthread and seeing what the buildfarm thinks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Something is rotten in the state of Denmark...

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Apr 2, 2015 at 3:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Perhaps the difference has to do with whether pg_am's pg_class tuple is
>> on a page that hasn't got enough room for a HOT update?  But I definitely
>> tried it several times and consistently got the same failure before.

> That seems plausible, except that I have no idea why that would vary
> from one test setup to another.  I suggest pushing the patch you
> proposed upthread and seeing what the buildfarm thinks.

Yeah.  I have errands to do right now but will push it later.
        regards, tom lane