VACUUM FULL versus relcache init files
От | Tom Lane |
---|---|
Тема | VACUUM FULL versus relcache init files |
Дата | |
Msg-id | 13119.1313456464@sss.pgh.pa.us обсуждение исходный текст |
Ответы |
Re: VACUUM FULL versus relcache init files
(Robert Haas <robertmhaas@gmail.com>)
|
Список | pgsql-hackers |
This might be the last bug from my concurrent-vacuum-full testing --- at least, I have no remaining unexplained events from about two full days of running the tests. The ones that are left involve backends randomly failing like this: psql: FATAL: could not read block 0 in file "base/130532080/130545668": read only 0 of 8192 bytes usually during startup, though I have one example of a backend being repeatedly unable to access pg_proc due to similar errors. I believe what this traces to is stale relfilenode information taken from relcache init files, which contain precomputed relcache data intended to speed up backend startup. There is a curious little dance between write_relcache_init_file and RelationCacheInitFileInvalidate that is intended to ensure that when a process creates a new relcache init file that is already stale (because someone else invalidated the information concurrently), the bad file will get unlinked and not used. I think I invented that logic, so it's my fault that it doesn't work. It works fine as long as you consider only the two processes directly involved; but a third process can get fooled into using stale data. The scenario requires two successive invalidations, as for example from vacuum full on two system catalogs in a row, plus a stream of incoming new backends. It goes like this: 1. Process A vacuums a system catalog, unlinks init file, sends sinval messages, does second unlink (which does nothing). 2. Process B starts, observes lack of init file, begins to construct a new one. It gets to the end of AcceptInvalidationMessages in write_relcache_init_file. Since it started after A sent the first sinvals, it sees no incoming sinval messages and has no reason to think its new init file isn't good. 3. Meanwhile, Process A vacuums another system catalog, unlinks init file (doing nothing), and finally sends its sinval messages just after B looked for them. Now it will block trying to get RelCacheInitLock, which B holds. 4. Process B renames its already-stale init file into place, then releases RelCacheInitLock. 5. Process A gets the lock and removes the stale init file. Now process B is okay, because it will see A's second sinvals before it tries to make any use of the relcache data it has. And the stale init file is definitely gone after step 5. However, between steps 4 and 5 there is a window for Process C to start, read the stale init file, and attempt to use it. Since C started after A's second set of sinval messages, it doesn't see them and doesn't know it has stale data. As far as I can see at the moment, the only way to make this bulletproof is to turn both creation and deletion of the init file into atomic operations that include sinval messaging. What I have in mind is Creator: must take RelCacheInitLock, check for incoming invals, rename the new file into place if none, release RelCacheInitLock. (This is the same as what it does now.) Destroyer: must take RelCacheInitLock, unlink the init file, send its sinvals, release RelCacheInitLock. This guarantees that we serialize the sending of the sinval messages so that anyone who sees a bad init file in place *must* see the sinval messages afterwards, so long as they join the sinval messaging ring before looking for the init file (which they do). I don't think it's any worse than the current scheme from a parallelism point of view: the destroyer is holding RelCacheInitLock a bit longer than before, but that should not be a performance critical situation. Anybody see a hole in that? regards, tom lane
В списке pgsql-hackers по дате отправления:
Следующее
От: Fujii MasaoДата:
Сообщение: Re: Enforcing that all WAL has been replayed after restoring from backup