Обсуждение: We're leaking predicate locks in HEAD

Поиск
Список
Период
Сортировка

We're leaking predicate locks in HEAD

От
Tom Lane
Дата:
After running the core regression tests with installcheck-parallel,
the pg_locks view sometimes shows me apparently-orphaned SIReadLock
entries.  They accumulate over repeated test runs.  Right now,
for example, I see

regression=# select * from pg_locks;
  locktype  | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid |
virtualtransaction| pid  |      mode       | granted | fastpath  

------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+------+-----------------+---------+----------
 relation   |   130144 |    12137 |      |       |            |               |         |       |          | 3/7977
       | 8924 | AccessShareLock | t       | t 
 virtualxid |          |          |      |       | 3/7977     |               |         |       |          | 3/7977
       | 8924 | ExclusiveLock   | t       | t 
 relation   |   130144 |   136814 |      |       |            |               |         |       |          | 22/536
       | 8076 | SIReadLock      | t       | f 
 relation   |   111195 |   118048 |      |       |            |               |         |       |          | 19/665
       | 6738 | SIReadLock      | t       | f 
 relation   |   130144 |   134850 |      |       |            |               |         |       |          | 12/3093
       | 7984 | SIReadLock      | t       | f 
(5 rows)

after having done a couple of installcheck iterations since starting the
postmaster.

The PIDs shown as holding those locks don't exist anymore, but digging
in the postmaster log shows that they were session backends during the
regression test runs.  Furthermore, it seems like they usually were the
ones running either the triggers or portals tests.

I don't see this behavior in v11 (though maybe I just didn't run it
long enough).  In HEAD, a run adds one or two new entries more often
than not.

This is a pretty bad bug IMO --- quite aside from any ill effects
of the entries themselves, the leak seems fast enough that it'd run
a production installation out of locktable space before very long.

I'd have to say that my first suspicion falls on bb16aba50 ...

            regards, tom lane



Re: We're leaking predicate locks in HEAD

От
Thomas Munro
Дата:
On Wed, May 8, 2019 at 5:46 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> After running the core regression tests with installcheck-parallel,
> the pg_locks view sometimes shows me apparently-orphaned SIReadLock
> entries. [...]

Ugh.

> I'd have to say that my first suspicion falls on bb16aba50 ...

Investigating.

-- 
Thomas Munro
https://enterprisedb.com



Re: We're leaking predicate locks in HEAD

От
Andrew Dunstan
Дата:
On 5/7/19 1:46 PM, Tom Lane wrote:
> After running the core regression tests with installcheck-parallel,
> the pg_locks view sometimes shows me apparently-orphaned SIReadLock
> entries.  They accumulate over repeated test runs.  


Should we have a test for that run at/near the end of the regression
tests? The buildfarm will actually do multiple runs like this if set up
to do parallel checks and test multiple locales.


cheers


andrew


-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: We're leaking predicate locks in HEAD

От
Tom Lane
Дата:
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
> On 5/7/19 1:46 PM, Tom Lane wrote:
>> After running the core regression tests with installcheck-parallel,
>> the pg_locks view sometimes shows me apparently-orphaned SIReadLock
>> entries.  They accumulate over repeated test runs.  

> Should we have a test for that run at/near the end of the regression
> tests? The buildfarm will actually do multiple runs like this if set up
> to do parallel checks and test multiple locales.

No, I'm not excited about that idea; I think it'd have all the same
fragility as the late lamented "REINDEX pg_class" test.  A given test
script has no business assuming that other test scripts aren't
legitimately taking out predicate locks, nor assuming that prior test
scripts are fully cleaned up when it runs.

            regards, tom lane



Re: We're leaking predicate locks in HEAD

От
Thomas Munro
Дата:
On Wed, May 8, 2019 at 6:56 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, May 8, 2019 at 5:46 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I'd have to say that my first suspicion falls on bb16aba50 ...
>
> Investigating.

Reproduced here.  Once the system reaches a state where it's leaking
(which happens only occasionally for me during installcheck-parallel),
it keeps leaking for future SSI transactions.  The cause is
SxactGlobalXmin getting stuck.  The attached fixes it for me.  I can't
remember why on earth I made that change, but it is quite clearly
wrong: you have to check every transaction, or you might never advance
SxactGlobalXmin.

-- 
Thomas Munro
https://enterprisedb.com

Вложения

Re: We're leaking predicate locks in HEAD

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@gmail.com> writes:
> Reproduced here.  Once the system reaches a state where it's leaking
> (which happens only occasionally for me during installcheck-parallel),
> it keeps leaking for future SSI transactions.  The cause is
> SxactGlobalXmin getting stuck.  The attached fixes it for me.  I can't
> remember why on earth I made that change, but it is quite clearly
> wrong: you have to check every transaction, or you might never advance
> SxactGlobalXmin.

Hm.  So I don't have any opinion about whether this is a correct fix for
the leak, but I am quite distressed that the system failed to notice that
it was leaking predicate locks.  Shouldn't there be the same sort of
leak-detection infrastructure that we have for most types of resources?

            regards, tom lane



Re: We're leaking predicate locks in HEAD

От
Thomas Munro
Дата:
On Wed, May 8, 2019 at 3:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > Reproduced here.  Once the system reaches a state where it's leaking
> > (which happens only occasionally for me during installcheck-parallel),
> > it keeps leaking for future SSI transactions.  The cause is
> > SxactGlobalXmin getting stuck.  The attached fixes it for me.  I can't
> > remember why on earth I made that change, but it is quite clearly
> > wrong: you have to check every transaction, or you might never advance
> > SxactGlobalXmin.
>
> Hm.  So I don't have any opinion about whether this is a correct fix for
> the leak, but I am quite distressed that the system failed to notice that
> it was leaking predicate locks.  Shouldn't there be the same sort of
> leak-detection infrastructure that we have for most types of resources?

Well, it is hooked up the usual release machinery, because it's in
ReleasePredicateLocks(), which is wired into the
RESOURCE_RELEASE_LOCKS phase of resowner.c.  The thing is that lock
lifetime is linked to the last transaction with the oldest known xmin,
not the transaction that created them.

More analysis:  Lock clean-up is deferred until "... the last
serializable transaction with the oldest xmin among serializable
transactions completes", but I broke that by excluding read-only
transactions from the check so that SxactGlobalXminCount gets out of
sync.  There's a read-only SSI transaction in
src/test/regress/sql/transactions.sql, but I think the reason the
problem manifests only intermittently with installcheck-parallel is
because sometimes the read-only optimisation kicks in (effectively
dropping us to plain old SI because there's no concurrent serializable
activity) and it doesn't take any locks at all, and sometimes the
read-only transaction doesn't have the oldest known xmin among
serializable transactions.  However, if a read-write SSI transaction
had already taken a snapshot and has the oldest xmin and then the
read-only one starts with the same xmin, we get into trouble.  When
the read-only one releases, we fail to decrement SxactGlobalXminCount,
and then we'll never call ClearOldPredicateLocks().

-- 
Thomas Munro
https://enterprisedb.com



Re: We're leaking predicate locks in HEAD

От
Thomas Munro
Дата:
On Wed, May 8, 2019 at 4:50 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, May 8, 2019 at 3:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Thomas Munro <thomas.munro@gmail.com> writes:
> > > Reproduced here.  Once the system reaches a state where it's leaking
> > > (which happens only occasionally for me during installcheck-parallel),
> > > it keeps leaking for future SSI transactions.  The cause is
> > > SxactGlobalXmin getting stuck.  The attached fixes it for me.  I can't
> > > remember why on earth I made that change, but it is quite clearly
> > > wrong: you have to check every transaction, or you might never advance
> > > SxactGlobalXmin.

I pushed a version of that, thereby reverting the already-analysed
hunk, and also another similar hunk (probably harmless).

The second hunk dates from a time in development when I was treating
the final clean-up at commit time as a regular commit, but that failed
in PreCommit_CheckForSerializationFailure() because the DOOMED flag
was set by the earlier RO_SAFE partial release.  The change was no
longer necessary, because final release of a partially released
read-only transaction is now done with isCommit forced to false.
(Before bb16aba50, it was done directly at RO_SAFE release time with
isCommit set to false, but bb16aba50 split the operation into two
phases, partial and then final, due to the extended object lifetime
requirement when sharing the SERIALIZABLEXACT with parallel workers.)

I'll update the open items page.

-- 
Thomas Munro
https://enterprisedb.com