Обсуждение: We're leaking predicate locks in HEAD
After running the core regression tests with installcheck-parallel, the pg_locks view sometimes shows me apparently-orphaned SIReadLock entries. They accumulate over repeated test runs. Right now, for example, I see regression=# select * from pg_locks; locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction| pid | mode | granted | fastpath ------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+------+-----------------+---------+---------- relation | 130144 | 12137 | | | | | | | | 3/7977 | 8924 | AccessShareLock | t | t virtualxid | | | | | 3/7977 | | | | | 3/7977 | 8924 | ExclusiveLock | t | t relation | 130144 | 136814 | | | | | | | | 22/536 | 8076 | SIReadLock | t | f relation | 111195 | 118048 | | | | | | | | 19/665 | 6738 | SIReadLock | t | f relation | 130144 | 134850 | | | | | | | | 12/3093 | 7984 | SIReadLock | t | f (5 rows) after having done a couple of installcheck iterations since starting the postmaster. The PIDs shown as holding those locks don't exist anymore, but digging in the postmaster log shows that they were session backends during the regression test runs. Furthermore, it seems like they usually were the ones running either the triggers or portals tests. I don't see this behavior in v11 (though maybe I just didn't run it long enough). In HEAD, a run adds one or two new entries more often than not. This is a pretty bad bug IMO --- quite aside from any ill effects of the entries themselves, the leak seems fast enough that it'd run a production installation out of locktable space before very long. I'd have to say that my first suspicion falls on bb16aba50 ... regards, tom lane
On Wed, May 8, 2019 at 5:46 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > After running the core regression tests with installcheck-parallel, > the pg_locks view sometimes shows me apparently-orphaned SIReadLock > entries. [...] Ugh. > I'd have to say that my first suspicion falls on bb16aba50 ... Investigating. -- Thomas Munro https://enterprisedb.com
On 5/7/19 1:46 PM, Tom Lane wrote: > After running the core regression tests with installcheck-parallel, > the pg_locks view sometimes shows me apparently-orphaned SIReadLock > entries. They accumulate over repeated test runs. Should we have a test for that run at/near the end of the regression tests? The buildfarm will actually do multiple runs like this if set up to do parallel checks and test multiple locales. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: > On 5/7/19 1:46 PM, Tom Lane wrote: >> After running the core regression tests with installcheck-parallel, >> the pg_locks view sometimes shows me apparently-orphaned SIReadLock >> entries. They accumulate over repeated test runs. > Should we have a test for that run at/near the end of the regression > tests? The buildfarm will actually do multiple runs like this if set up > to do parallel checks and test multiple locales. No, I'm not excited about that idea; I think it'd have all the same fragility as the late lamented "REINDEX pg_class" test. A given test script has no business assuming that other test scripts aren't legitimately taking out predicate locks, nor assuming that prior test scripts are fully cleaned up when it runs. regards, tom lane
On Wed, May 8, 2019 at 6:56 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Wed, May 8, 2019 at 5:46 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I'd have to say that my first suspicion falls on bb16aba50 ... > > Investigating. Reproduced here. Once the system reaches a state where it's leaking (which happens only occasionally for me during installcheck-parallel), it keeps leaking for future SSI transactions. The cause is SxactGlobalXmin getting stuck. The attached fixes it for me. I can't remember why on earth I made that change, but it is quite clearly wrong: you have to check every transaction, or you might never advance SxactGlobalXmin. -- Thomas Munro https://enterprisedb.com
Вложения
Thomas Munro <thomas.munro@gmail.com> writes: > Reproduced here. Once the system reaches a state where it's leaking > (which happens only occasionally for me during installcheck-parallel), > it keeps leaking for future SSI transactions. The cause is > SxactGlobalXmin getting stuck. The attached fixes it for me. I can't > remember why on earth I made that change, but it is quite clearly > wrong: you have to check every transaction, or you might never advance > SxactGlobalXmin. Hm. So I don't have any opinion about whether this is a correct fix for the leak, but I am quite distressed that the system failed to notice that it was leaking predicate locks. Shouldn't there be the same sort of leak-detection infrastructure that we have for most types of resources? regards, tom lane
On Wed, May 8, 2019 at 3:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > Reproduced here. Once the system reaches a state where it's leaking > > (which happens only occasionally for me during installcheck-parallel), > > it keeps leaking for future SSI transactions. The cause is > > SxactGlobalXmin getting stuck. The attached fixes it for me. I can't > > remember why on earth I made that change, but it is quite clearly > > wrong: you have to check every transaction, or you might never advance > > SxactGlobalXmin. > > Hm. So I don't have any opinion about whether this is a correct fix for > the leak, but I am quite distressed that the system failed to notice that > it was leaking predicate locks. Shouldn't there be the same sort of > leak-detection infrastructure that we have for most types of resources? Well, it is hooked up the usual release machinery, because it's in ReleasePredicateLocks(), which is wired into the RESOURCE_RELEASE_LOCKS phase of resowner.c. The thing is that lock lifetime is linked to the last transaction with the oldest known xmin, not the transaction that created them. More analysis: Lock clean-up is deferred until "... the last serializable transaction with the oldest xmin among serializable transactions completes", but I broke that by excluding read-only transactions from the check so that SxactGlobalXminCount gets out of sync. There's a read-only SSI transaction in src/test/regress/sql/transactions.sql, but I think the reason the problem manifests only intermittently with installcheck-parallel is because sometimes the read-only optimisation kicks in (effectively dropping us to plain old SI because there's no concurrent serializable activity) and it doesn't take any locks at all, and sometimes the read-only transaction doesn't have the oldest known xmin among serializable transactions. However, if a read-write SSI transaction had already taken a snapshot and has the oldest xmin and then the read-only one starts with the same xmin, we get into trouble. When the read-only one releases, we fail to decrement SxactGlobalXminCount, and then we'll never call ClearOldPredicateLocks(). -- Thomas Munro https://enterprisedb.com
On Wed, May 8, 2019 at 4:50 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Wed, May 8, 2019 at 3:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Thomas Munro <thomas.munro@gmail.com> writes: > > > Reproduced here. Once the system reaches a state where it's leaking > > > (which happens only occasionally for me during installcheck-parallel), > > > it keeps leaking for future SSI transactions. The cause is > > > SxactGlobalXmin getting stuck. The attached fixes it for me. I can't > > > remember why on earth I made that change, but it is quite clearly > > > wrong: you have to check every transaction, or you might never advance > > > SxactGlobalXmin. I pushed a version of that, thereby reverting the already-analysed hunk, and also another similar hunk (probably harmless). The second hunk dates from a time in development when I was treating the final clean-up at commit time as a regular commit, but that failed in PreCommit_CheckForSerializationFailure() because the DOOMED flag was set by the earlier RO_SAFE partial release. The change was no longer necessary, because final release of a partially released read-only transaction is now done with isCommit forced to false. (Before bb16aba50, it was done directly at RO_SAFE release time with isCommit set to false, but bb16aba50 split the operation into two phases, partial and then final, due to the extended object lifetime requirement when sharing the SERIALIZABLEXACT with parallel workers.) I'll update the open items page. -- Thomas Munro https://enterprisedb.com