Re: VACUUM FULL versus relcache init files

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: VACUUM FULL versus relcache init files
Дата
Msg-id CA+TgmoZxvcb9S6ojhfptJxakLwBud3kpAAg=4-UFPdwHRzR73Q@mail.gmail.com
обсуждение исходный текст
Ответ на VACUUM FULL versus relcache init files  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
On Mon, Aug 15, 2011 at 9:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> This might be the last bug from my concurrent-vacuum-full testing --- at
> least, I have no remaining unexplained events from about two full days
> of running the tests.  The ones that are left involve backends randomly
> failing like this:
>
> psql: FATAL:  could not read block 0 in file "base/130532080/130545668": read only 0 of 8192 bytes
>
> usually during startup, though I have one example of a backend being
> repeatedly unable to access pg_proc due to similar errors.
>
> I believe what this traces to is stale relfilenode information taken
> from relcache init files, which contain precomputed relcache data
> intended to speed up backend startup.  There is a curious little dance
> between write_relcache_init_file and RelationCacheInitFileInvalidate
> that is intended to ensure that when a process creates a new relcache
> init file that is already stale (because someone else invalidated the
> information concurrently), the bad file will get unlinked and not used.
> I think I invented that logic, so it's my fault that it doesn't work.
>
> It works fine as long as you consider only the two processes directly
> involved; but a third process can get fooled into using stale data.
> The scenario requires two successive invalidations, as for example from
> vacuum full on two system catalogs in a row, plus a stream of incoming
> new backends.  It goes like this:
>
> 1. Process A vacuums a system catalog, unlinks init file, sends sinval
> messages, does second unlink (which does nothing).
>
> 2. Process B starts, observes lack of init file, begins to construct
> a new one.  It gets to the end of AcceptInvalidationMessages in
> write_relcache_init_file.  Since it started after A sent the first
> sinvals, it sees no incoming sinval messages and has no reason to think
> its new init file isn't good.
>
> 3. Meanwhile, Process A vacuums another system catalog, unlinks init
> file (doing nothing), and finally sends its sinval messages just after
> B looked for them.  Now it will block trying to get RelCacheInitLock,
> which B holds.
>
> 4. Process B renames its already-stale init file into place, then
> releases RelCacheInitLock.
>
> 5. Process A gets the lock and removes the stale init file.
>
> Now process B is okay, because it will see A's second sinvals before it
> tries to make any use of the relcache data it has.  And the stale init
> file is definitely gone after step 5.
>
> However, between steps 4 and 5 there is a window for Process C to start,
> read the stale init file, and attempt to use it.  Since C started after
> A's second set of sinval messages, it doesn't see them and doesn't know
> it has stale data.
>
> As far as I can see at the moment, the only way to make this bulletproof
> is to turn both creation and deletion of the init file into atomic
> operations that include sinval messaging.  What I have in mind is
>
> Creator: must take RelCacheInitLock, check for incoming invals, rename
> the new file into place if none, release RelCacheInitLock.  (This is
> the same as what it does now.)
>
> Destroyer: must take RelCacheInitLock, unlink the init file, send its
> sinvals, release RelCacheInitLock.
>
> This guarantees that we serialize the sending of the sinval messages
> so that anyone who sees a bad init file in place *must* see the sinval
> messages afterwards, so long as they join the sinval messaging ring
> before looking for the init file (which they do).  I don't think it's
> any worse than the current scheme from a parallelism point of view: the
> destroyer is holding RelCacheInitLock a bit longer than before, but that
> should not be a performance critical situation.
>
> Anybody see a hole in that?

I don't.  Seems more robust than the old way.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Heikki Linnakangas
Дата:
Сообщение: Re: WIP: Fast GiST index build
Следующее
От: Tom Lane
Дата:
Сообщение: Re: error: could not find pg_class tuple for index 2662