VACUUM FULL versus relcache init files

Поиск
Список
Период
Сортировка
От Tom Lane
Тема VACUUM FULL versus relcache init files
Дата
Msg-id 13119.1313456464@sss.pgh.pa.us
обсуждение исходный текст
Ответы Re: VACUUM FULL versus relcache init files  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
This might be the last bug from my concurrent-vacuum-full testing --- at
least, I have no remaining unexplained events from about two full days
of running the tests.  The ones that are left involve backends randomly
failing like this:

psql: FATAL:  could not read block 0 in file "base/130532080/130545668": read only 0 of 8192 bytes

usually during startup, though I have one example of a backend being
repeatedly unable to access pg_proc due to similar errors.

I believe what this traces to is stale relfilenode information taken
from relcache init files, which contain precomputed relcache data
intended to speed up backend startup.  There is a curious little dance
between write_relcache_init_file and RelationCacheInitFileInvalidate
that is intended to ensure that when a process creates a new relcache
init file that is already stale (because someone else invalidated the
information concurrently), the bad file will get unlinked and not used.
I think I invented that logic, so it's my fault that it doesn't work.

It works fine as long as you consider only the two processes directly
involved; but a third process can get fooled into using stale data.
The scenario requires two successive invalidations, as for example from
vacuum full on two system catalogs in a row, plus a stream of incoming
new backends.  It goes like this:

1. Process A vacuums a system catalog, unlinks init file, sends sinval
messages, does second unlink (which does nothing).

2. Process B starts, observes lack of init file, begins to construct
a new one.  It gets to the end of AcceptInvalidationMessages in
write_relcache_init_file.  Since it started after A sent the first
sinvals, it sees no incoming sinval messages and has no reason to think
its new init file isn't good.

3. Meanwhile, Process A vacuums another system catalog, unlinks init
file (doing nothing), and finally sends its sinval messages just after
B looked for them.  Now it will block trying to get RelCacheInitLock,
which B holds.

4. Process B renames its already-stale init file into place, then
releases RelCacheInitLock.

5. Process A gets the lock and removes the stale init file.

Now process B is okay, because it will see A's second sinvals before it
tries to make any use of the relcache data it has.  And the stale init
file is definitely gone after step 5.

However, between steps 4 and 5 there is a window for Process C to start,
read the stale init file, and attempt to use it.  Since C started after
A's second set of sinval messages, it doesn't see them and doesn't know
it has stale data.

As far as I can see at the moment, the only way to make this bulletproof
is to turn both creation and deletion of the init file into atomic
operations that include sinval messaging.  What I have in mind is

Creator: must take RelCacheInitLock, check for incoming invals, rename
the new file into place if none, release RelCacheInitLock.  (This is
the same as what it does now.)

Destroyer: must take RelCacheInitLock, unlink the init file, send its
sinvals, release RelCacheInitLock.

This guarantees that we serialize the sending of the sinval messages
so that anyone who sees a bad init file in place *must* see the sinval
messages afterwards, so long as they join the sinval messaging ring
before looking for the init file (which they do).  I don't think it's
any worse than the current scheme from a parallelism point of view: the
destroyer is holding RelCacheInitLock a bit longer than before, but that
should not be a performance critical situation.

Anybody see a hole in that?
        regards, tom lane


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: index-only scans
Следующее
От: Fujii Masao
Дата:
Сообщение: Re: Enforcing that all WAL has been replayed after restoring from backup