Re: Full page images in WAL & Cache Invalidation
От | Florian G. Pflug |
---|---|
Тема | Re: Full page images in WAL & Cache Invalidation |
Дата | |
Msg-id | 46A4ADD0.8040500@phlo.org обсуждение исходный текст |
Ответ на | Re: Full page images in WAL & Cache Invalidation (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: Full page images in WAL & Cache Invalidation
(Tom Lane <tgl@sss.pgh.pa.us>)
|
Список | pgsql-hackers |
Tom Lane wrote: > "Florian G. Pflug" <fgp@phlo.org> writes: >>> Anyway, if you believe that DDL is infrequent, why are you resistant >>> to the idea of WAL-logging cache flushes? > >> First, cache invalidations are not the only problem caused by replaying >> system-table updates. The whole SnapshotNow >> business doesn't exactly make things easier too. So it feels like a >> lot of added complexity and code for little gain - unless a *lot* >> more things (like locking requests) are logged too. > > The mention of locking requests brought to mind the following > gedankenexperiment: > > 1. slave server backend is running some long-running query on table X. > > 2. WAL-reading process receives and executes DROP TABLE X. > > (It doesn't even have to be a DROP; most varieties of ALTER are enough > to create problems for a concurrently-running query.) > > It's really hard to see how to defend against that without a fairly > complete simulation of locking on the slave side. Well, locking on the slave is a bit easier than on the master, for two reasons 1) Queries running on the slave only lock in AccessShareLock mode - therefore, you only need to know if a certain modeconflics with AccessShareLock - and there seems to be only one that does, namely AccessExclusiveLock. So we reallyonly need to care about AccessExclusiveLock locks on the master 2) As far as I can see, the point of an AccessExclusiveLock is *not* actually preventing queries from running while a DDLstatement is *executed*, but rather preventing queries from running while the statement is *committed*. This fitsthe fact that system tables are read using SnapshotNow (not SnapshotDirty) - while the DDL is running, everybodyis happily using the old information, the trouble would only start after the commit because with SnapshotNow you suddenly see the new state. I not yet 100% sure that (2) holds (with the exception of VACUUM FULL)- but I'm fairly confident, because if (2) was wrong, than how would the system survive a crash during the execution of a DDL statement? So after a bit more thought (And reading. Thanks for all the replies, guys! They are greatly appreciated.), I came up with the following plan for both inval events and locks .) Store two flags in the commit record of a transaction, for "transaction generated inval events" and "transaction heldan access exlusive lock". .) Upon replay, block until no transactions are running (for "transaction held an exclusive lock") before replaying the record, or flush the caches after replaying it (for "transaction generated inval events"). This scheme has two really nice properties: First, it can be extended fairly easily to not store a simple flag, but a list of OIDs, and use that to make the flushing and blocking more fine-grained - without changing the basic way in which things work. And second, it stores all information needed in the *commit* record. That removes problems with transactions that are aborted due to a crash, and therefor the WAL doesn't tell that they were aborted. VACUUM FULL will need some special care - but at least VACUUM FULL is already such a disruptive operation, that it probably won't surprise anyone if it's disruptive on the slave too. (And now that CLUSTER is MVCC-Safe from what I read, the usecase for VACUUM FULL seems to be pretty slim). The btree metapage caching will need special treatement too - probably some flags in the WAL record that change the metapage that instruct the slave to synthesize a suitable inval event. What stays open is regular VACUUM (and maybe inline vacuuming - is that part of the latest version of HOT, or did it get removed?). Here, the key is logging the return value of GetOldestXmin() I think. Since that value is what decides in the end if a tuple can be killed or not, having it available on the slave should allow the slave to block replay until no slave query depends on the tuple anymore. greetings, Florian Pflug
В списке pgsql-hackers по дате отправления: