Re: Autovacuum breakage from a734fd5d1

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: Autovacuum breakage from a734fd5d1
Дата
Msg-id 13026.1480286736@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Autovacuum breakage from a734fd5d1  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Autovacuum breakage from a734fd5d1
Список pgsql-hackers
I wrote:
> Buildfarm member skink failed a couple days ago:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2016-11-25%2017%3A50%3A01

Ah ... I can reproduce this with moderate reliability (one failure every
10 or so iterations of the regression tests) by inserting a delay just
before autovacuum's check of orphan status:

*** a/src/backend/postmaster/autovacuum.c
--- b/src/backend/postmaster/autovacuum.c
*************** do_autovacuum(void)
*** 2046,2051 ****
--- 2046,2053 ----         {             int            backendID;
+             pg_usleep(100000);
+              backendID = GetTempNamespaceBackendId(classForm->relnamespace);              /* We just ignore it if the
owningbackend is still active */ 


I think the sequence of events must be:

1. autovacuum starts its seqscan of pg_class, locking down the snapshot
it's going to use for that.

2. Some backend completes its session and drops some temp table(s).

3. autovacuum's scan arrives at the pg_class entry for one of these
tables.  By now it's committed dead, but it's still visible according
to the seqscan's snapshot, so we make the above test.  Assuming the
owning backend has vacated its sinval slot and no new session has
occupied it, we'll decide the table is orphan and record its OID
for later deletion.

4. The later code that tries to drop the table is able to see that
it's gone by now.  Kaboom.

In existing releases, it would be about impossible for this race condition
to persist long enough that we'd actually try to drop the table.  It's
definitely possible that we'd try to print "found orphan temp table",
but guess what: the back-branch coding here is
                   ereport(LOG,                           (errmsg("autovacuum: found orphan temp table \"%s\".\"%s\" in
database\"%s\"",                                get_namespace_name(classForm->relnamespace),
      NameStr(classForm->relname),                                   get_database_name(MyDatabaseId)))); 

The only part of that that would be at risk of failure is the
get_namespace_name call, and since we don't ordinarily remove the
pg_namespace entries for temp schemas, it's not likely to fail either.
So the race condition does exist in released branches, but it would not
cause an autovacuum crash even if libc is unforgiving about printf'ing a
NULL string.  At most it would cause bogus log entries claiming that
temp tables have been orphaned when they haven't.

I went digging in the buildfarm logs and was able to find one single
instance of a "found orphan temp table" log entry that couldn't be
blamed on a prior backend crash; it's in this report:

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tick&dt=2015-07-29%2003%3A37%3A52

So the problem seems to be confirmed to exist, but be of low probability
and low consequences, in back branches.  I think we only need to fix it in
HEAD.  The lock acquisition and status recheck that I proposed before
should be sufficient.
        regards, tom lane



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Erik Rijkers
Дата:
Сообщение: Re: Logical Replication WIP
Следующее
От: Petr Jelinek
Дата:
Сообщение: Re: Logical Replication WIP