Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
От | Tom Lane |
---|---|
Тема | Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes |
Дата | |
Msg-id | 17825.1535556627@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | Re: buildfarm: could not read block 3 in file "base/16384/2662":read only 0 of 8192 bytes (Andres Freund <andres@anarazel.de>) |
Ответы |
Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
(Tom Lane <tgl@sss.pgh.pa.us>)
Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes (Tom Lane <tgl@sss.pgh.pa.us>) |
Список | pgsql-hackers |
Andres Freund <andres@anarazel.de> writes: > It's not OK to rebuild relcache entries in the middle of > ReceiveSharedInvalidMessages() - a later entry in the invalidation queue > might be relmapper invalidation, and thus immediately processing a > relcache invalidation might attempt to scan a relation that does not > exist anymore. Check, but the bug is actually more subtle than that. I've reproduced the problem using your scripts, changed the ERROR to PANIC, and studied the stack traces of a few core dumps, and here's what I find is the scenario: * Code tries to open some relation or other. After acquiring lock, it goes off to do AcceptInvalidationMessages (cf. LockRelationOid). That reads a relcache inval message for "t", and because "t"'s relcache entry is already open, it decides to rebuild it immediately. * Therefore it calls ScanPgRelation, which needs to do a search using pg_class_oid_index, so it opens and locks that. Again, having acquired the lock in LockRelationOid, it goes off to do AcceptInvalidationMessages (recursively). * Now we read another relcache inval (for "t_pkey", in the cases I've traced through in detail). Again, that's open and we decide we'd better do RelationReloadIndexInfo for it. * We now recursively enter ScanPgRelation, which (again) needs to do a search using pg_class_oid_index, so it (again) opens and locks that. BUT: LockRelationOid sees that *this process already has share lock on pg_class_oid_index*, so it figures it can skip AcceptInvalidationMessages. Had we done AcceptInvalidationMessages at this point, we'd have found the relcache inval message that's waiting for us about pg_class_oid_index, and we'd have processed it before attempting to read pg_class_oid_index, and all would be well. It's the missed AIM call due to the recursive entry to ScanPgRelation that is causing us to fail to read all available inval messages before believing that our info about pg_class_oid_index is up to date. > Instead, receiving a relcache invalidation now just queues an entry onto > a new "pending rebuilds" list, that is then processed in a second stage, > after emptying the entire sinval queue. At that stage we're guaranteed > that the relmapper is up2date. More to the point, we're guaranteed that we've seen all inval messages that were sent before we acquired whatever lock we just acquired. (In the cases I've looked at, the local relmapper *is* up to date, it's the relcache entry for pg_class_oid_index that is not.) My first gut feeling after understanding the recursion aspect was that what you're suggesting would not be enough to be bulletproof. We will still have the hazard that while we're rebuilding relcache entries, we may end up doing recursive ScanPgRelation calls and the inner ones will decide they needn't call AcceptInvalidationMessages. So we might well still be rebuilding stuff without having seen every available inval message. But it should be OK, because any inval messages we could possibly need to worry about will have been scanned and queued by the outer call. We could perhaps fix this with a less invasive change than what you suggest here, by attacking the missed-call-due-to-recursion aspect rather than monkeying with how relcache rebuild itself works. But I'm thinking this way might be a good idea anyhow, in order to reduce the depth of recursion that occurs in scenarios like this. I've not read your patch yet, so no comment on specifics. regards, tom lane
В списке pgsql-hackers по дате отправления:
Следующее
От: Tom LaneДата:
Сообщение: Re: Is child process of postmaster able to access all the databases?