Обсуждение: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

Поиск
Список
Период
Сортировка

Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

От
Robert Haas
Дата:
On Thu, Feb 11, 2016 at 9:29 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <rhaas@postgresql.org> writes:
>> Add some isolation tests for deadlock detection and resolution.
>
> Buildfarm says this needs work ...
>
> dromedary is one of mine, do you need me to look into what is
> happening?

That would be great.  Taking a look at what happened, I have a feeling
this may be a race condition of some kind in the isolation tester.  It
seems to have failed to recognize that a1 started waiting, and that
caused the "deadlock detected" message to reported differently.  I'm
not immediately sure what to do about that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

От
Robert Haas
Дата:
On Thu, Feb 11, 2016 at 9:36 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Feb 11, 2016 at 9:29 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <rhaas@postgresql.org> writes:
>>> Add some isolation tests for deadlock detection and resolution.
>>
>> Buildfarm says this needs work ...
>>
>> dromedary is one of mine, do you need me to look into what is
>> happening?
>
> That would be great.  Taking a look at what happened, I have a feeling
> this may be a race condition of some kind in the isolation tester.  It
> seems to have failed to recognize that a1 started waiting, and that
> caused the "deadlock detected" message to reported differently.  I'm
> not immediately sure what to do about that.

Yeah, so: try_complete_step() waits 10ms, and if it still hasn't
gotten any data back from the server, then it uses a separate query to
see whether the step in question is waiting on a lock.  So what
must've happened here is that it took more than 10ms for the process
to show up as waiting in pg_stat_activity.

It might be possible to fix this by not passing STEP_NONBLOCK if
there's only one connection that isn't waiting.  I think I had it like
that at one point, and then took it out because it caused some other
problem.  Another option is to lengthen the timeout.  It doesn't seem
great to be dependent on a fixed timeout, but the server doesn't send
any protocol traffic to indicate a lock wait.  If we declared which
steps are supposed to wait, then there'd be no ambiguity, but that
seems like a drag.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
>> That would be great.  Taking a look at what happened, I have a feeling
>> this may be a race condition of some kind in the isolation tester.  It
>> seems to have failed to recognize that a1 started waiting, and that
>> caused the "deadlock detected" message to reported differently.  I'm
>> not immediately sure what to do about that.

> Yeah, so: try_complete_step() waits 10ms, and if it still hasn't
> gotten any data back from the server, then it uses a separate query to
> see whether the step in question is waiting on a lock.  So what
> must've happened here is that it took more than 10ms for the process
> to show up as waiting in pg_stat_activity.

No, because the machines that are failing are showing a "<waiting ...>"
annotation that your reference output *doesn't* have.  I think what is
actually happening is that these machines are seeing the process as
waiting and reporting it, whereas on your machine the backend detects
the deadlock and completes the query (with an error) before
isolationtester realizes that the process is waiting.

It would probably help if you didn't do this:

setup        { BEGIN; SET deadlock_timeout = '10ms'; }

which pretty much guarantees that there is a race condition: you've set it
so that the deadlock detector will run at approximately the same time when
isolationtester will be probing the state.  I'm surprised that it seemed
to act consistently for you.  I would suggest putting all the other
sessions to deadlock_timeout of 100s and the one you want to fail to
timeout of ~ 5s.  That will mean that the "<waiting ...>" output should
show up pretty reliably even on overloaded buildfarm critters.
        regards, tom lane



Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

От
Tom Lane
Дата:
I wrote:
> No, because the machines that are failing are showing a "<waiting ...>"
> annotation that your reference output *doesn't* have.  I think what is
> actually happening is that these machines are seeing the process as
> waiting and reporting it, whereas on your machine the backend detects
> the deadlock and completes the query (with an error) before
> isolationtester realizes that the process is waiting.

I confirmed this theory by the expedient of changing the '10ms' setting
in the test script to 1ms (which worked) and 100ms (which did not, on
the same machine).

I've committed an update that adjusts the timeouts to hopefully ensure
that isolationtester always sees the query as blocked before the deadlock
detector unblocks it; which seems like the behavior we want to test for,
anyway.
        regards, tom lane



Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

От
Robert Haas
Дата:
On Thu, Feb 11, 2016 at 12:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> No, because the machines that are failing are showing a "<waiting ...>"
>> annotation that your reference output *doesn't* have.  I think what is
>> actually happening is that these machines are seeing the process as
>> waiting and reporting it, whereas on your machine the backend detects
>> the deadlock and completes the query (with an error) before
>> isolationtester realizes that the process is waiting.
>
> I confirmed this theory by the expedient of changing the '10ms' setting
> in the test script to 1ms (which worked) and 100ms (which did not, on
> the same machine).
>
> I've committed an update that adjusts the timeouts to hopefully ensure
> that isolationtester always sees the query as blocked before the deadlock
> detector unblocks it; which seems like the behavior we want to test for,
> anyway.

Thanks.  I really appreciate it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

От
Tom Lane
Дата:
We're not out of the woods on this :-( ... jaguarundi, which is the first
of the CLOBBER_CACHE_ALWAYS animals to run these tests, didn't like them
at all.  I think I fixed the deadlock-soft-2 failure, but its take on
deadlock-hard is:

*** 17,25 **** step s6a7: LOCK TABLE a7; <waiting ...> step s7a8: LOCK TABLE a8; <waiting ...> step s8a1: LOCK TABLE
a1;<waiting ...>
 
- step s8a1: <... completed> step s7a8: <... completed>
! error in steps s8a1 s7a8: ERROR:  deadlock detected step s8c: COMMIT; step s7c: COMMIT; step s6a7: <... completed>
--- 17,25 ---- step s6a7: LOCK TABLE a7; <waiting ...> step s7a8: LOCK TABLE a8; <waiting ...> step s8a1: LOCK TABLE
a1;<waiting ...> step s7a8: <... completed>
 
! step s8a1: <... completed>
! ERROR:  deadlock detected step s8c: COMMIT; step s7c: COMMIT; step s6a7: <... completed>

The problem here is that when the deadlock detector kills s8's
transaction, s7a8 is also left free to proceed, so there is a race
condition as to which query completion will get back to
isolationtester first.

One grotty way to handle that would be something like

-step "s7a8"    { LOCK TABLE a8; }
+step "s7a8"    { LOCK TABLE a8; SELECT pg_sleep(5); }

Or we could simplify the locking structure enough so that no other
transactions are released by the deadlock failure.  I do not know
exactly what you had in mind to be testing here?
        regards, tom lane



Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

От
Robert Haas
Дата:
On Thu, Feb 11, 2016 at 11:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> We're not out of the woods on this :-( ... jaguarundi, which is the first
> of the CLOBBER_CACHE_ALWAYS animals to run these tests, didn't like them
> at all.  I think I fixed the deadlock-soft-2 failure, but its take on
> deadlock-hard is:
>
> *** 17,25 ****
>   step s6a7: LOCK TABLE a7; <waiting ...>
>   step s7a8: LOCK TABLE a8; <waiting ...>
>   step s8a1: LOCK TABLE a1; <waiting ...>
> - step s8a1: <... completed>
>   step s7a8: <... completed>
> ! error in steps s8a1 s7a8: ERROR:  deadlock detected
>   step s8c: COMMIT;
>   step s7c: COMMIT;
>   step s6a7: <... completed>
> --- 17,25 ----
>   step s6a7: LOCK TABLE a7; <waiting ...>
>   step s7a8: LOCK TABLE a8; <waiting ...>
>   step s8a1: LOCK TABLE a1; <waiting ...>
>   step s7a8: <... completed>
> ! step s8a1: <... completed>
> ! ERROR:  deadlock detected
>   step s8c: COMMIT;
>   step s7c: COMMIT;
>   step s6a7: <... completed>
>
> The problem here is that when the deadlock detector kills s8's
> transaction, s7a8 is also left free to proceed, so there is a race
> condition as to which query completion will get back to
> isolationtester first.
>
> One grotty way to handle that would be something like
>
> -step "s7a8"    { LOCK TABLE a8; }
> +step "s7a8"    { LOCK TABLE a8; SELECT pg_sleep(5); }
>
> Or we could simplify the locking structure enough so that no other
> transactions are released by the deadlock failure.  I do not know
> exactly what you had in mind to be testing here?

Basically just that the deadlock actually got detected.   That may
sound a bit weak, but considering we had no test for it at all before
this...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Feb 11, 2016 at 11:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The problem here is that when the deadlock detector kills s8's
>> transaction, s7a8 is also left free to proceed, so there is a race
>> condition as to which query completion will get back to
>> isolationtester first.
>> 
>> One grotty way to handle that would be something like
>> 
>> -step "s7a8"    { LOCK TABLE a8; }
>> +step "s7a8"    { LOCK TABLE a8; SELECT pg_sleep(5); }
>> 
>> Or we could simplify the locking structure enough so that no other
>> transactions are released by the deadlock failure.  I do not know
>> exactly what you had in mind to be testing here?

> Basically just that the deadlock actually got detected.   That may
> sound a bit weak, but considering we had no test for it at all before
> this...

I tried fixing it as shown above, and was dismayed to find out that
it didn't work, ie, there was still a difference between the regular
output and the results with CLOBBER_CACHE_ALWAYS.  In the latter case
the printout makes it appear that s7a8 completed before s8a1, which
is nonsensical.

Investigation showed that there are a couple of reasons.  One,
isolationtester's is-it-waiting query takes an insane amount of
time under CLOBBER_CACHE_ALWAYS --- over half a second on my
reasonably new server.  Probing the state of half a dozen blocked
sessions thus takes a while.  Second, once s8 has been booted out
of its transaction, s7 is no longer "blocked" according to
isolationtester's definition (it's doing the pg_sleep query
instead).  Therefore, when we're rechecking all the other blocked
steps after detecting that s8 has become blocked, two things
happen: enough time elapses for the deadlock detector to fire,
and then when we get around to checking s7, we don't see it as
blocked and therefore wait until it finishes.  So s7a8 is reported
first despite the pg_sleep, and would be no matter large a pg_sleep
delay is used.

We could possibly fix this by using a deadlock timeout even higher than
5 seconds, but that way madness lies.

Instead, what I propose we do about this is to change isolationtester
so that once it's decided that a given step is blocked, it no longer
issues the is-it-waiting query for that step; it just assumes that the
step should be treated as blocked.  So all we need do for "backlogged"
steps is check PQisBusy/PQconsumeInput.  That both greatly reduces the
number of is-it-waiting queries that are needed and avoids any flappy
behavior of the answer.

Comments?
        regards, tom lane



Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

От
Tom Lane
Дата:
I wrote:
> Instead, what I propose we do about this is to change isolationtester
> so that once it's decided that a given step is blocked, it no longer
> issues the is-it-waiting query for that step; it just assumes that the
> step should be treated as blocked.  So all we need do for "backlogged"
> steps is check PQisBusy/PQconsumeInput.  That both greatly reduces the
> number of is-it-waiting queries that are needed and avoids any flappy
> behavior of the answer.

Hmm, that seemed to work fine here, but the buildfarm is not very happy
with it, and on reflection I guess it's just moving the indeterminacy
somewhere else.  If we check for completion of a given step, and don't
wait till it's either completed or known blocked, then we have a race
condition that can change the order in which completions are reported.

The fundamental problem I fear is that isolationtester is designed around
the assumption that only its own actions (query issuances) change the
state of the database.  Trying to use it to test deadlock detection is
problematic because deadlock-breaking changes the DB state asynchronously.

I think what we have to do is revert that change and dumb down
deadlock-hard until it produces stable results even on the CLOBBER
critters.  One thing that'd help is reducing the number of processes
involved --- AFAICS, testing an 8-way deadlock is not really any more
interesting than testing, say, 4-way, and that would halve the amount
of time isolationtester spends figuring out the state.
        regards, tom lane



Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

От
Robert Haas
Дата:
On Fri, Feb 12, 2016 at 4:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> Instead, what I propose we do about this is to change isolationtester
>> so that once it's decided that a given step is blocked, it no longer
>> issues the is-it-waiting query for that step; it just assumes that the
>> step should be treated as blocked.  So all we need do for "backlogged"
>> steps is check PQisBusy/PQconsumeInput.  That both greatly reduces the
>> number of is-it-waiting queries that are needed and avoids any flappy
>> behavior of the answer.
>
> Hmm, that seemed to work fine here, but the buildfarm is not very happy
> with it, and on reflection I guess it's just moving the indeterminacy
> somewhere else.  If we check for completion of a given step, and don't
> wait till it's either completed or known blocked, then we have a race
> condition that can change the order in which completions are reported.
>
> The fundamental problem I fear is that isolationtester is designed around
> the assumption that only its own actions (query issuances) change the
> state of the database.  Trying to use it to test deadlock detection is
> problematic because deadlock-breaking changes the DB state asynchronously.
>
> I think what we have to do is revert that change and dumb down
> deadlock-hard until it produces stable results even on the CLOBBER
> critters.  One thing that'd help is reducing the number of processes
> involved --- AFAICS, testing an 8-way deadlock is not really any more
> interesting than testing, say, 4-way, and that would halve the amount
> of time isolationtester spends figuring out the state.

Maybe we should introduce a way to declare whether a step is expected
to wait or not.  I thought about doing that, and the only reason I
didn't is because I couldn't figure out a reasonable syntax.  But, in
many respects, that would actually be better than the current system
of having isolationtester try to figure it out itself.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

От
Andres Freund
Дата:
On 2016-02-12 13:16:54 -0500, Tom Lane wrote:
> Investigation showed that there are a couple of reasons.  One,
> isolationtester's is-it-waiting query takes an insane amount of
> time under CLOBBER_CACHE_ALWAYS --- over half a second on my
> reasonably new server.

I wonder if we shouldn't just expose a 'which pid is process X waiting
for' API, implemented serverside. That's generally really useful, and
looks like it's actually going to be less complicated than that
query... And it's surely going to be faster.

Andres



Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

От
Robert Haas
Дата:
On Fri, Feb 12, 2016 at 6:22 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-02-12 13:16:54 -0500, Tom Lane wrote:
>> Investigation showed that there are a couple of reasons.  One,
>> isolationtester's is-it-waiting query takes an insane amount of
>> time under CLOBBER_CACHE_ALWAYS --- over half a second on my
>> reasonably new server.
>
> I wonder if we shouldn't just expose a 'which pid is process X waiting
> for' API, implemented serverside. That's generally really useful, and
> looks like it's actually going to be less complicated than that
> query... And it's surely going to be faster.

If PID 12000 and PID 13000 hold AccessShareLock on relation foo, and
PID 14000 awaits AccessExclusiveLock on that relation, what does the
function return when 14000 is passed as an argument?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Feb 12, 2016 at 6:22 PM, Andres Freund <andres@anarazel.de> wrote:
>> I wonder if we shouldn't just expose a 'which pid is process X waiting
>> for' API, implemented serverside. That's generally really useful, and
>> looks like it's actually going to be less complicated than that
>> query... And it's surely going to be faster.

> If PID 12000 and PID 13000 hold AccessShareLock on relation foo, and
> PID 14000 awaits AccessExclusiveLock on that relation, what does the
> function return when 14000 is passed as an argument?

Yeah.  In general, it's not that easy to say that A is waiting
specifically on B --- there may be multiple blockers and/or multiple ways
it could get released.  isolationtester's query is not really correct
IMO.  It's okay as long as nothing else besides autovacuum is taking
locks, but I wouldn't want to try to make it work in a general purpose
environment.
        regards, tom lane



Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Feb 12, 2016 at 4:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The fundamental problem I fear is that isolationtester is designed around
>> the assumption that only its own actions (query issuances) change the
>> state of the database.  Trying to use it to test deadlock detection is
>> problematic because deadlock-breaking changes the DB state asynchronously.

> Maybe we should introduce a way to declare whether a step is expected
> to wait or not.  I thought about doing that, and the only reason I
> didn't is because I couldn't figure out a reasonable syntax.  But, in
> many respects, that would actually be better than the current system
> of having isolationtester try to figure it out itself.

Meh.  I'm not sure that would actually help anything.  The problem is not
so much with figuring out whether a step blocks, as knowing when it's
expected to complete.  And for sure I don't want to annotate the spec
files to the point of saying "this action should cause these other steps
to complete".

Actually ... the thing we've been fighting over the past couple days is
having to make step completion orders deterministic when in principle
they aren't and don't need to be.  Maybe the solution is something like
the core tests' variant expected files, ugly though those are.
        regards, tom lane



Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

От
Greg Stark
Дата:
<p dir="ltr">The tests worked fine on faster build animals, right? And the clobber animals are much much slower.... So
itseems perfectly sensible that their deadlock timeout would just have to be much much higher to have the same
behaviour.I see nothing wrong in just setting deadlock timeout to a minute or more on these machines. <p dir="ltr">The
invariantis just that the deadlock timeout needs enough head room over the actual time the tester's queries take. If
theynormally take a 1/10th of a second then why not just set the timeout to 10x however long they take on the clobber
animals?<pdir="ltr">-- <br /> Greg 

Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

От
Tom Lane
Дата:
Greg Stark <stark@mit.edu> writes:
> The tests worked fine on faster build animals, right? And the clobber
> animals are much much slower.... So it seems perfectly sensible that their
> deadlock timeout would just have to be much much higher to have the same
> behaviour. I see nothing wrong in just setting deadlock timeout to a minute
> or more on these machines.

We don't have a way to make the isolation tests change behavior depending
on how the backend is compiled.  So the only actually available fix is to
make that test take "a minute or more" for *everybody*.

Aside from that, it's just disturbing that these tests aren't
deterministic regardless of machine speed.  We don't seem to have a
way around that right now, but I wish we did.
        regards, tom lane



Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> I wonder if we shouldn't just expose a 'which pid is process X waiting
> for' API, implemented serverside. That's generally really useful, and
> looks like it's actually going to be less complicated than that
> query... And it's surely going to be faster.

Attached is a draft patch for a new function that reports the set of PIDs
directly blocking a given PID (that is, holding or awaiting conflicting
locks on the lockable object it's waiting for).

I replaced isolationtester's pg_locks query with this, and found that
it's about 9X faster in a normal build, and 3X faster with
CLOBBER_CACHE_ALWAYS turned on.  That would give us some nice headroom
for the isolation tests with CLOBBER_CACHE_ALWAYS animals.  (Note that
in view of
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=markhor&dt=2016-02-14%2007%3A38%3A37
we still need to do *something* about the speed of the new deadlock-hard
test; this patch could avoid the need to dumb down or slow down that test
even further.)

Not to be neglected also is that (I believe) this gives the right answer,
whereas isolationtester's existing query is currently completely broken by
parallel queries, and it doesn't understand non-conflicting lock modes
either.  (It did, at least partially, before commit 38f8bdcac4982215;
I am not sure that taking out the mode checks was a good idea.  But
putting them back would make the query slower yet.)

This is WIP, in part because I've written no user docs for the new
function, and in part because I think people might want to bikeshed the
API.  What is here is:

"pg_blocker_pids(integer) returns integer[]" returns a possibly-empty
array of the PIDs of backend processes that block the backend with
specified PID.  You get an empty array, not an error, if the argument
isn't a valid backend PID or that backend isn't waiting.  In parallel
query situations, the output includes PIDs that are blocking any PID
in the given process's lock group, and what is reported is always
the PID of the lock group leader for whichever process is kdoing the
blocking.  Also, in parallel query situations, the same PID might appear
multiple times in the output; it didn't seem worth trying to eliminate
duplicates.

Reasonable API change requests might include returning a rowset rather
than an array and returning more data per lock conflict.  I've not
bothered with either thing here because isolationtester doesn't care
and it would make the query somewhat slower for isolationtester's
usage (though, probably, not enough slower to really matter).

It should also be noted that I've not really tested the parallel
query aspects of this, because I'm not sure how to create a conflicting
lock request in a parallel worker.  However, if I'm not still
misunderstanding the new semantics of the lock data structures, that
aspect of it seems unlikely to be wrong.

Comments?

            regards, tom lane

diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 91218d0..97e8962 100644
*** a/src/backend/storage/ipc/procarray.c
--- b/src/backend/storage/ipc/procarray.c
*************** HaveVirtualXIDsDelayingChkpt(VirtualTran
*** 2313,2318 ****
--- 2313,2341 ----
  PGPROC *
  BackendPidGetProc(int pid)
  {
+     PGPROC       *result;
+
+     if (pid == 0)                /* never match dummy PGPROCs */
+         return NULL;
+
+     LWLockAcquire(ProcArrayLock, LW_SHARED);
+
+     result = BackendPidGetProcWithLock(pid);
+
+     LWLockRelease(ProcArrayLock);
+
+     return result;
+ }
+
+ /*
+  * BackendPidGetProcWithLock -- get a backend's PGPROC given its PID
+  *
+  * Same as above, except caller must be holding ProcArrayLock.  The found
+  * entry, if any, can be assumed to be valid as long as the lock remains held.
+  */
+ PGPROC *
+ BackendPidGetProcWithLock(int pid)
+ {
      PGPROC       *result = NULL;
      ProcArrayStruct *arrayP = procArray;
      int            index;
*************** BackendPidGetProc(int pid)
*** 2320,2327 ****
      if (pid == 0)                /* never match dummy PGPROCs */
          return NULL;

-     LWLockAcquire(ProcArrayLock, LW_SHARED);
-
      for (index = 0; index < arrayP->numProcs; index++)
      {
          PGPROC       *proc = &allProcs[arrayP->pgprocnos[index]];
--- 2343,2348 ----
*************** BackendPidGetProc(int pid)
*** 2333,2340 ****
          }
      }

-     LWLockRelease(ProcArrayLock);
-
      return result;
  }

--- 2354,2359 ----
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index fef59a2..fb32769 100644
*** a/src/backend/storage/lmgr/lock.c
--- b/src/backend/storage/lmgr/lock.c
***************
*** 21,27 ****
   *
   *    Interface:
   *
!  *    InitLocks(), GetLocksMethodTable(),
   *    LockAcquire(), LockRelease(), LockReleaseAll(),
   *    LockCheckConflicts(), GrantLock()
   *
--- 21,27 ----
   *
   *    Interface:
   *
!  *    InitLocks(), GetLocksMethodTable(), GetLockTagsMethodTable(),
   *    LockAcquire(), LockRelease(), LockReleaseAll(),
   *    LockCheckConflicts(), GrantLock()
   *
***************
*** 41,46 ****
--- 41,47 ----
  #include "pg_trace.h"
  #include "pgstat.h"
  #include "storage/proc.h"
+ #include "storage/procarray.h"
  #include "storage/sinvaladt.h"
  #include "storage/spin.h"
  #include "storage/standby.h"
*************** static void CleanUpLock(LOCK *lock, PROC
*** 356,361 ****
--- 357,364 ----
  static void LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
                       LOCKTAG *locktag, LOCKMODE lockmode,
                       bool decrement_strong_lock_count);
+ static void GetSingleProcBlockerStatusData(PGPROC *blocked_proc,
+                                BlockedProcsData *data);


  /*
*************** GetLocksMethodTable(const LOCK *lock)
*** 462,467 ****
--- 465,482 ----
      return LockMethods[lockmethodid];
  }

+ /*
+  * Fetch the lock method table associated with a given locktag
+  */
+ LockMethod
+ GetLockTagsMethodTable(const LOCKTAG *locktag)
+ {
+     LOCKMETHODID lockmethodid = (LOCKMETHODID) locktag->locktag_lockmethodid;
+
+     Assert(0 < lockmethodid && lockmethodid < lengthof(LockMethods));
+     return LockMethods[lockmethodid];
+ }
+

  /*
   * Compute the hash code associated with a LOCKTAG.
*************** GetLockStatusData(void)
*** 3406,3412 ****
       * impractical (in particular, note MAX_SIMUL_LWLOCKS).  It shouldn't
       * matter too much, because none of these locks can be involved in lock
       * conflicts anyway - anything that might must be present in the main lock
!      * table.
       */
      for (i = 0; i < ProcGlobal->allProcCount; ++i)
      {
--- 3421,3430 ----
       * impractical (in particular, note MAX_SIMUL_LWLOCKS).  It shouldn't
       * matter too much, because none of these locks can be involved in lock
       * conflicts anyway - anything that might must be present in the main lock
!      * table.  (For the same reason, we don't sweat about making leaderPid
!      * completely valid.  We cannot safely dereference another backend's
!      * lockGroupLeader field without holding all lock partition locks, and
!      * it's not worth that.)
       */
      for (i = 0; i < ProcGlobal->allProcCount; ++i)
      {
*************** GetLockStatusData(void)
*** 3439,3444 ****
--- 3457,3463 ----
              instance->backend = proc->backendId;
              instance->lxid = proc->lxid;
              instance->pid = proc->pid;
+             instance->leaderPid = proc->pid;
              instance->fastpath = true;

              el++;
*************** GetLockStatusData(void)
*** 3466,3471 ****
--- 3485,3491 ----
              instance->backend = proc->backendId;
              instance->lxid = proc->lxid;
              instance->pid = proc->pid;
+             instance->leaderPid = proc->pid;
              instance->fastpath = true;

              el++;
*************** GetLockStatusData(void)
*** 3517,3522 ****
--- 3537,3543 ----
          instance->backend = proc->backendId;
          instance->lxid = proc->lxid;
          instance->pid = proc->pid;
+         instance->leaderPid = proclock->groupLeader->pid;
          instance->fastpath = false;

          el++;
*************** GetLockStatusData(void)
*** 3538,3543 ****
--- 3559,3754 ----
  }

  /*
+  * GetBlockerStatusData - Return a summary of the lock manager's state
+  * concerning locks that are blocking the specified PID or any member of
+  * the PID's lock group, for use in a user-level reporting function.
+  *
+  * For each PID within the lock group that is awaiting some heavyweight lock,
+  * the return data includes an array of LockInstanceData objects, which are
+  * the same data structure used by GetLockStatusData; but unlike that function,
+  * this one reports only the PROCLOCKs associated with the lock that that pid
+  * is blocked on.  (Hence, all the locktags should be the same for any one
+  * blocked PID.)  In addition, we return an array of the PIDs of those backends
+  * that are ahead of the blocked PID in the lock's wait queue.  These can be
+  * compared with the pids in the LockInstanceData objects to determine which
+  * waiters are ahead of or behind the blocked PID in the queue.
+  *
+  * If blocked_pid isn't a valid backend PID or nothing in its lock group is
+  * waiting on any heavyweight lock, return empty arrays.
+  *
+  * The design goal is to hold the LWLocks for as short a time as possible;
+  * thus, this function simply makes a copy of the necessary data and releases
+  * the locks, allowing the caller to contemplate and format the data for as
+  * long as it pleases.
+  */
+ BlockedProcsData *
+ GetBlockerStatusData(int blocked_pid)
+ {
+     BlockedProcsData *data;
+     PGPROC       *proc;
+     int            i;
+
+     data = (BlockedProcsData *) palloc(sizeof(BlockedProcsData));
+
+     /*
+      * Guess how much space we'll need, and preallocate.  Most of the time
+      * this will avoid needing to do repalloc while holding the LWLocks.  (We
+      * assume, but check with an Assert, that MaxBackends is enough entries
+      * for the procs[] array; the other two could need enlargement, though.)
+      */
+     data->nprocs = data->nlocks = data->npids = 0;
+     data->maxprocs = data->maxlocks = data->maxpids = MaxBackends;
+     data->procs = (BlockedProcData *) palloc(sizeof(BlockedProcData) * data->maxprocs);
+     data->locks = (LockInstanceData *) palloc(sizeof(LockInstanceData) * data->maxlocks);
+     data->waiter_pids = (int *) palloc(sizeof(int) * data->maxpids);
+
+     /*
+      * In order to search the ProcArray for blocked_pid and assume that that
+      * entry won't immediately disappear under us, we must hold ProcArrayLock.
+      * In addition, to examine the lock grouping fields of any other backend,
+      * we must hold all the hash partition locks.  (Only one of those locks is
+      * actually relevant for any one lock group, but we can't know which one
+      * ahead of time.)    It's fairly annoying to hold all those locks
+      * throughout this, but it's no worse than GetLockStatusData(), and it
+      * does have the advantage that we're guaranteed to return a
+      * self-consistent instantaneous state.
+      */
+     LWLockAcquire(ProcArrayLock, LW_SHARED);
+
+     proc = BackendPidGetProcWithLock(blocked_pid);
+
+     /* Nothing to do if it's gone */
+     if (proc != NULL)
+     {
+         /*
+          * Acquire lock on the entire shared lock data structure.  See notes
+          * in GetLockStatusData().
+          */
+         for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
+             LWLockAcquire(LockHashPartitionLockByIndex(i), LW_SHARED);
+
+         if (proc->lockGroupLeader == NULL)
+         {
+             /* Easy case, proc is not a lock group member */
+             GetSingleProcBlockerStatusData(proc, data);
+         }
+         else
+         {
+             /* Examine all procs in proc's lock group */
+             dlist_iter    iter;
+
+             dlist_foreach(iter, &proc->lockGroupLeader->lockGroupMembers)
+             {
+                 PGPROC       *memberProc;
+
+                 memberProc = dlist_container(PGPROC, lockGroupLink, iter.cur);
+                 GetSingleProcBlockerStatusData(memberProc, data);
+             }
+         }
+
+         /*
+          * And release locks.  See notes in GetLockStatusData().
+          */
+         for (i = NUM_LOCK_PARTITIONS; --i >= 0;)
+             LWLockRelease(LockHashPartitionLockByIndex(i));
+
+         Assert(data->nprocs <= data->maxprocs);
+     }
+
+     LWLockRelease(ProcArrayLock);
+
+     return data;
+ }
+
+ /* Accumulate data about one possibly-blocked proc for GetBlockerStatusData */
+ static void
+ GetSingleProcBlockerStatusData(PGPROC *blocked_proc, BlockedProcsData *data)
+ {
+     LOCK       *theLock = blocked_proc->waitLock;
+     BlockedProcData *bproc;
+     SHM_QUEUE  *procLocks;
+     PROCLOCK   *proclock;
+     PROC_QUEUE *waitQueue;
+     PGPROC       *proc;
+     int            queue_size;
+     int            i;
+
+     /* Nothing to do if this proc is not blocked */
+     if (theLock == NULL)
+         return;
+
+     /* Set up a procs[] element */
+     bproc = &data->procs[data->nprocs++];
+     bproc->pid = blocked_proc->pid;
+     bproc->first_lock = data->nlocks;
+     bproc->first_waiter = data->npids;
+
+     /*
+      * We may ignore the proc's fast-path arrays, since nothing in those could
+      * be related to a contended lock.
+      */
+
+     /* Scan all PROCLOCKs associated with theLock. */
+     procLocks = &(theLock->procLocks);
+     proclock = (PROCLOCK *) SHMQueueNext(procLocks, procLocks,
+                                          offsetof(PROCLOCK, lockLink));
+     while (proclock)
+     {
+         PGPROC       *proc = proclock->tag.myProc;
+         LOCK       *lock = proclock->tag.myLock;
+         LockInstanceData *instance;
+
+         if (data->nlocks >= data->maxlocks)
+         {
+             data->maxlocks += MaxBackends;
+             data->locks = (LockInstanceData *)
+                 repalloc(data->locks, sizeof(LockInstanceData) * data->maxlocks);
+         }
+
+         instance = &data->locks[data->nlocks];
+         memcpy(&instance->locktag, &lock->tag, sizeof(LOCKTAG));
+         instance->holdMask = proclock->holdMask;
+         if (proc->waitLock == proclock->tag.myLock)
+             instance->waitLockMode = proc->waitLockMode;
+         else
+             instance->waitLockMode = NoLock;
+         instance->backend = proc->backendId;
+         instance->lxid = proc->lxid;
+         instance->pid = proc->pid;
+         instance->leaderPid = proclock->groupLeader->pid;
+         instance->fastpath = false;
+         data->nlocks++;
+
+         proclock = (PROCLOCK *) SHMQueueNext(procLocks, &proclock->lockLink,
+                                              offsetof(PROCLOCK, lockLink));
+     }
+
+     /* Now scan the lock's wait queue, stopping at blocked_proc */
+     waitQueue = &(theLock->waitProcs);
+     queue_size = waitQueue->size;
+
+     if (queue_size > data->maxpids - data->npids)
+     {
+         data->maxpids = Max(data->maxpids + MaxBackends,
+                             data->npids + queue_size);
+         data->waiter_pids = (int *) repalloc(data->waiter_pids,
+                                              sizeof(int) * data->maxpids);
+     }
+
+     proc = (PGPROC *) waitQueue->links.next;
+     for (i = 0; i < queue_size; i++)
+     {
+         if (proc == blocked_proc)
+             break;
+         data->waiter_pids[data->npids++] = proc->pid;
+         proc = (PGPROC *) proc->links.next;
+     }
+
+     bproc->num_locks = data->nlocks - bproc->first_lock;
+     bproc->num_waiters = data->npids - bproc->first_waiter;
+ }
+
+ /*
   * Returns a list of currently held AccessExclusiveLocks, for use by
   * LogStandbySnapshot().  The result is a palloc'd array,
   * with the number of elements returned into *nlocks.
diff --git a/src/backend/utils/adt/lockfuncs.c b/src/backend/utils/adt/lockfuncs.c
index 73c78e9..ad9072d 100644
*** a/src/backend/utils/adt/lockfuncs.c
--- b/src/backend/utils/adt/lockfuncs.c
***************
*** 18,23 ****
--- 18,24 ----
  #include "funcapi.h"
  #include "miscadmin.h"
  #include "storage/predicate_internals.h"
+ #include "utils/array.h"
  #include "utils/builtins.h"


*************** pg_lock_status(PG_FUNCTION_ARGS)
*** 99,105 ****
          oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);

          /* build tupdesc for result tuples */
!         /* this had better match pg_locks view in system_views.sql */
          tupdesc = CreateTemplateTupleDesc(NUM_LOCK_STATUS_COLUMNS, false);
          TupleDescInitEntry(tupdesc, (AttrNumber) 1, "locktype",
                             TEXTOID, -1, 0);
--- 100,106 ----
          oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);

          /* build tupdesc for result tuples */
!         /* this had better match function's declaration in pg_proc.h */
          tupdesc = CreateTemplateTupleDesc(NUM_LOCK_STATUS_COLUMNS, false);
          TupleDescInitEntry(tupdesc, (AttrNumber) 1, "locktype",
                             TEXTOID, -1, 0);
*************** pg_lock_status(PG_FUNCTION_ARGS)
*** 395,400 ****
--- 396,511 ----


  /*
+  * pg_blocker_pids - produce an array of the PIDs blocking given PID
+  *
+  * In parallel-query cases, we report all PIDs blocking any member of the
+  * given PID's lock group, and the reported PIDs are always those of lock
+  * group leaders.  We do not bother eliminating duplicates from the result.
+  *
+  * We need not consider predicate locks here, since those don't block anything.
+  */
+ Datum
+ pg_blocker_pids(PG_FUNCTION_ARGS)
+ {
+     int            blocked_pid = PG_GETARG_INT32(0);
+     Datum       *arrayelems;
+     int            narrayelems;
+     BlockedProcsData *lockData; /* state data from lmgr */
+     int            i,
+                 j;
+
+     /* Collect a snapshot of lock manager state */
+     lockData = GetBlockerStatusData(blocked_pid);
+
+     /* We can't need more output entries than there are reported PROCLOCKs */
+     arrayelems = (Datum *) palloc(lockData->nlocks * sizeof(Datum));
+     narrayelems = 0;
+
+     /* For each blocked proc in the lock group ... */
+     for (i = 0; i < lockData->nprocs; i++)
+     {
+         BlockedProcData *bproc = &lockData->procs[i];
+         LockInstanceData *instances = &lockData->locks[bproc->first_lock];
+         int           *preceding_waiters = &lockData->waiter_pids[bproc->first_waiter];
+         LockInstanceData *blocked_instance;
+         LockMethod    lockMethodTable;
+         int            conflictMask;
+
+         /*
+          * Locate the blocked proc's own entry in the LockInstanceData array.
+          * There should be exactly one matching entry.
+          */
+         blocked_instance = NULL;
+         for (j = 0; j < bproc->num_locks; j++)
+         {
+             LockInstanceData *instance = &(instances[j]);
+
+             if (instance->pid == bproc->pid)
+             {
+                 Assert(blocked_instance == NULL);
+                 blocked_instance = instance;
+             }
+         }
+         Assert(blocked_instance != NULL);
+
+         lockMethodTable = GetLockTagsMethodTable(&(blocked_instance->locktag));
+         conflictMask = lockMethodTable->conflictTab[blocked_instance->waitLockMode];
+
+         /* Now scan the PROCLOCK data for conflicting procs */
+         for (j = 0; j < bproc->num_locks; j++)
+         {
+             LockInstanceData *instance = &(instances[j]);
+
+             /* A proc never blocks itself, so ignore that entry */
+             if (instance == blocked_instance)
+                 continue;
+             /* Members of same lock group never block each other, either */
+             if (instance->leaderPid == blocked_instance->leaderPid)
+                 continue;
+
+             if (conflictMask & instance->holdMask)
+                  /* blocked by lock already held by this entry */ ;
+             else if (instance->waitLockMode != NoLock &&
+                      (conflictMask & LOCKBIT_ON(instance->waitLockMode)))
+             {
+                 /* conflict with lock requested by this entry; who's in front? */
+                 bool        ahead = false;
+                 int            k;
+
+                 for (k = 0; k < bproc->num_waiters; k++)
+                 {
+                     if (preceding_waiters[k] == instance->pid)
+                     {
+                         /* this entry is waiting ahead of blocked proc */
+                         ahead = true;
+                         break;
+                     }
+                 }
+                 if (!ahead)
+                     continue;    /* not blocked by this entry */
+             }
+             else
+             {
+                 /* not blocked by this entry */
+                 continue;
+             }
+
+             /* blocked by this entry, so emit a record */
+             arrayelems[narrayelems++] = Int32GetDatum(instance->leaderPid);
+         }
+     }
+
+     /* Assert we didn't overrun arrayelems[] */
+     Assert(narrayelems <= lockData->nlocks);
+
+     /* Construct array, using hardwired knowledge about int4 type */
+     PG_RETURN_ARRAYTYPE_P(construct_array(arrayelems, narrayelems,
+                                           INT4OID,
+                                           sizeof(int32), true, 'i'));
+ }
+
+
+ /*
   * Functions for manipulating advisory locks
   *
   * We make use of the locktag fields as follows:
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 1c0ef9a..90a02b3 100644
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
*************** DATA(insert OID = 3329 (  pg_show_all_fi
*** 3012,3017 ****
--- 3012,3019 ----
  DESCR("show config file settings");
  DATA(insert OID = 1371 (  pg_lock_status   PGNSP PGUID 12 1 1000 0 0 f f f f t t v s 0 0 2249 ""
"{25,26,26,23,21,25,28,26,26,21,25,23,25,16,16}""{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}"
"{locktype,database,relation,page,tuple,virtualxid,transactionid,classid,objid,objsubid,virtualtransaction,pid,mode,granted,fastpath}"
_null__null_ pg_lock_status _null_ _null_ _null_ )); 
  DESCR("view system lock information");
+ DATA(insert OID = 2561 (  pg_blocker_pids  PGNSP PGUID 12 1 0 0 0 f f f f t f v s 1 0 1007 "23" _null_ _null_ _null_
_null__null_ pg_blocker_pids _null_ _null_ _null_ )); 
+ DESCR("report PIDs of blockers of specified backend PID, as array");
  DATA(insert OID = 1065 (  pg_prepared_xact PGNSP PGUID 12 1 1000 0 0 f f f f t t v s 0 0 2249 "" "{28,25,1184,26,26}"
"{o,o,o,o,o}""{transaction,gid,prepared,ownerid,dbid}" _null_ _null_ pg_prepared_xact _null_ _null_ _null_ )); 
  DESCR("view two-phase transactions");
  DATA(insert OID = 3819 (  pg_get_multixact_members PGNSP PGUID 12 1 1000 0 0 f f f f t t v s 1 0 2249 "28"
"{28,28,25}""{i,o,o}" "{multixid,xid,mode}" _null_ _null_ pg_get_multixact_members _null_ _null_ _null_ )); 
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 6b4e365..2ab5aa9 100644
*** a/src/include/storage/lock.h
--- b/src/include/storage/lock.h
*************** typedef struct PROCLOCK
*** 346,352 ****
      PROCLOCKTAG tag;            /* unique identifier of proclock object */

      /* data */
!     PGPROC       *groupLeader;    /* group leader, or NULL if no lock group */
      LOCKMASK    holdMask;        /* bitmask for lock types currently held */
      LOCKMASK    releaseMask;    /* bitmask for lock types to be released */
      SHM_QUEUE    lockLink;        /* list link in LOCK's list of proclocks */
--- 346,352 ----
      PROCLOCKTAG tag;            /* unique identifier of proclock object */

      /* data */
!     PGPROC       *groupLeader;    /* proc's lock group leader, or proc itself */
      LOCKMASK    holdMask;        /* bitmask for lock types currently held */
      LOCKMASK    releaseMask;    /* bitmask for lock types to be released */
      SHM_QUEUE    lockLink;        /* list link in LOCK's list of proclocks */
*************** typedef struct LOCALLOCK
*** 423,443 ****

  typedef struct LockInstanceData
  {
!     LOCKTAG        locktag;        /* locked object */
      LOCKMASK    holdMask;        /* locks held by this PGPROC */
      LOCKMODE    waitLockMode;    /* lock awaited by this PGPROC, if any */
      BackendId    backend;        /* backend ID of this PGPROC */
      LocalTransactionId lxid;    /* local transaction ID of this PGPROC */
      int            pid;            /* pid of this PGPROC */
      bool        fastpath;        /* taken via fastpath? */
  } LockInstanceData;

  typedef struct LockData
  {
      int            nelements;        /* The length of the array */
!     LockInstanceData *locks;
  } LockData;


  /* Result codes for LockAcquire() */
  typedef enum
--- 423,470 ----

  typedef struct LockInstanceData
  {
!     LOCKTAG        locktag;        /* tag for locked object */
      LOCKMASK    holdMask;        /* locks held by this PGPROC */
      LOCKMODE    waitLockMode;    /* lock awaited by this PGPROC, if any */
      BackendId    backend;        /* backend ID of this PGPROC */
      LocalTransactionId lxid;    /* local transaction ID of this PGPROC */
      int            pid;            /* pid of this PGPROC */
+     int            leaderPid;        /* pid of group leader; = pid if no group */
      bool        fastpath;        /* taken via fastpath? */
  } LockInstanceData;

  typedef struct LockData
  {
      int            nelements;        /* The length of the array */
!     LockInstanceData *locks;    /* Array of per-PROCLOCK information */
  } LockData;

+ typedef struct BlockedProcData
+ {
+     int            pid;            /* pid of a blocked PGPROC */
+     /* Per-PROCLOCK information about PROCLOCKs of the lock the pid awaits */
+     /* (these fields refer to indexes in BlockedProcsData.locks[]) */
+     int            first_lock;        /* index of first relevant LockInstanceData */
+     int            num_locks;        /* number of relevant LockInstanceDatas */
+     /* PIDs of PGPROCs that are ahead of "pid" in the lock's wait queue */
+     /* (these fields refer to indexes in BlockedProcsData.waiter_pids[]) */
+     int            first_waiter;    /* index of first preceding waiter */
+     int            num_waiters;    /* number of preceding waiters */
+ } BlockedProcData;
+
+ typedef struct BlockedProcsData
+ {
+     BlockedProcData *procs;        /* Array of per-blocked-proc information */
+     LockInstanceData *locks;    /* Array of per-PROCLOCK information */
+     int           *waiter_pids;    /* Array of PIDs of other blocked PGPROCs */
+     int            nprocs;            /* # of valid entries in procs[] array */
+     int            maxprocs;        /* Allocated length of procs[] array */
+     int            nlocks;            /* # of valid entries in locks[] array */
+     int            maxlocks;        /* Allocated length of locks[] array */
+     int            npids;            /* # of valid entries in waiter_pids[] array */
+     int            maxpids;        /* Allocated length of waiter_pids[] array */
+ } BlockedProcsData;
+

  /* Result codes for LockAcquire() */
  typedef enum
*************** typedef enum
*** 488,493 ****
--- 515,521 ----
   */
  extern void InitLocks(void);
  extern LockMethod GetLocksMethodTable(const LOCK *lock);
+ extern LockMethod GetLockTagsMethodTable(const LOCKTAG *locktag);
  extern uint32 LockTagHashCode(const LOCKTAG *locktag);
  extern bool DoLockModesConflict(LOCKMODE mode1, LOCKMODE mode2);
  extern LockAcquireResult LockAcquire(const LOCKTAG *locktag,
*************** extern void GrantAwaitedLock(void);
*** 520,525 ****
--- 548,554 ----
  extern void RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode);
  extern Size LockShmemSize(void);
  extern LockData *GetLockStatusData(void);
+ extern BlockedProcsData *GetBlockerStatusData(int blocked_pid);

  extern xl_standby_lock *GetRunningTransactionLocks(int *nlocks);
  extern const char *GetLockmodeName(LOCKMETHODID lockmethodid, LOCKMODE mode);
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 1fbf4f3..dd37c0c 100644
*** a/src/include/storage/procarray.h
--- b/src/include/storage/procarray.h
*************** extern VirtualTransactionId *GetVirtualX
*** 61,66 ****
--- 61,67 ----
  extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids);

  extern PGPROC *BackendPidGetProc(int pid);
+ extern PGPROC *BackendPidGetProcWithLock(int pid);
  extern int    BackendXidGetPid(TransactionId xid);
  extern bool IsBackendPid(int pid);

diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index affcc01..4848cbd 100644
*** a/src/include/utils/builtins.h
--- b/src/include/utils/builtins.h
*************** extern Datum row_security_active_name(PG
*** 1153,1158 ****
--- 1153,1159 ----

  /* lockfuncs.c */
  extern Datum pg_lock_status(PG_FUNCTION_ARGS);
+ extern Datum pg_blocker_pids(PG_FUNCTION_ARGS);
  extern Datum pg_advisory_lock_int8(PG_FUNCTION_ARGS);
  extern Datum pg_advisory_xact_lock_int8(PG_FUNCTION_ARGS);
  extern Datum pg_advisory_lock_shared_int8(PG_FUNCTION_ARGS);
diff --git a/src/test/isolation/isolationtester.c b/src/test/isolation/isolationtester.c
index 0a9d25c..a85d24e 100644
*** a/src/test/isolation/isolationtester.c
--- b/src/test/isolation/isolationtester.c
*************** main(int argc, char **argv)
*** 227,253 ****
       */
      initPQExpBuffer(&wait_query);
      appendPQExpBufferStr(&wait_query,
!                          "SELECT 1 FROM pg_locks holder, pg_locks waiter "
!                          "WHERE NOT waiter.granted AND waiter.pid = $1 "
!                          "AND holder.granted "
!                          "AND holder.pid <> $1 AND holder.pid IN (");
      /* The spec syntax requires at least one session; assume that here. */
      appendPQExpBufferStr(&wait_query, backend_pids[1]);
      for (i = 2; i < nconns; i++)
!         appendPQExpBuffer(&wait_query, ", %s", backend_pids[i]);
!     appendPQExpBufferStr(&wait_query,
!                          ") "
!
!                   "AND holder.locktype IS NOT DISTINCT FROM waiter.locktype "
!                   "AND holder.database IS NOT DISTINCT FROM waiter.database "
!                   "AND holder.relation IS NOT DISTINCT FROM waiter.relation "
!                          "AND holder.page IS NOT DISTINCT FROM waiter.page "
!                          "AND holder.tuple IS NOT DISTINCT FROM waiter.tuple "
!               "AND holder.virtualxid IS NOT DISTINCT FROM waiter.virtualxid "
!         "AND holder.transactionid IS NOT DISTINCT FROM waiter.transactionid "
!                     "AND holder.classid IS NOT DISTINCT FROM waiter.classid "
!                          "AND holder.objid IS NOT DISTINCT FROM waiter.objid "
!                 "AND holder.objsubid IS NOT DISTINCT FROM waiter.objsubid ");

      res = PQprepare(conns[0], PREP_WAITING, wait_query.data, 0, NULL);
      if (PQresultStatus(res) != PGRES_COMMAND_OK)
--- 227,238 ----
       */
      initPQExpBuffer(&wait_query);
      appendPQExpBufferStr(&wait_query,
!                          "SELECT pg_catalog.pg_blocker_pids($1) && '{");
      /* The spec syntax requires at least one session; assume that here. */
      appendPQExpBufferStr(&wait_query, backend_pids[1]);
      for (i = 2; i < nconns; i++)
!         appendPQExpBuffer(&wait_query, ",%s", backend_pids[i]);
!     appendPQExpBufferStr(&wait_query, "}'::integer[]");

      res = PQprepare(conns[0], PREP_WAITING, wait_query.data, 0, NULL);
      if (PQresultStatus(res) != PGRES_COMMAND_OK)
*************** try_complete_step(Step *step, int flags)
*** 745,765 ****
              /* If it's OK for the step to block, check whether it has. */
              if (flags & STEP_NONBLOCK)
              {
!                 int            ntuples;

                  res = PQexecPrepared(conns[0], PREP_WAITING, 1,
                                       &backend_pids[step->session + 1],
                                       NULL, NULL, 0);
!                 if (PQresultStatus(res) != PGRES_TUPLES_OK)
                  {
                      fprintf(stderr, "lock wait query failed: %s",
                              PQerrorMessage(conn));
                      exit_nicely();
                  }
!                 ntuples = PQntuples(res);
                  PQclear(res);

!                 if (ntuples >= 1)        /* waiting to acquire a lock */
                  {
                      if (!(flags & STEP_RETRY))
                          printf("step %s: %s <waiting ...>\n",
--- 730,751 ----
              /* If it's OK for the step to block, check whether it has. */
              if (flags & STEP_NONBLOCK)
              {
!                 bool        waiting;

                  res = PQexecPrepared(conns[0], PREP_WAITING, 1,
                                       &backend_pids[step->session + 1],
                                       NULL, NULL, 0);
!                 if (PQresultStatus(res) != PGRES_TUPLES_OK ||
!                     PQntuples(res) != 1)
                  {
                      fprintf(stderr, "lock wait query failed: %s",
                              PQerrorMessage(conn));
                      exit_nicely();
                  }
!                 waiting = ((PQgetvalue(res, 0, 0))[0] == 't');
                  PQclear(res);

!                 if (waiting)    /* waiting to acquire a lock */
                  {
                      if (!(flags & STEP_RETRY))
                          printf("step %s: %s <waiting ...>\n",

Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

От
Robert Haas
Дата:
On Tue, Feb 16, 2016 at 2:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@anarazel.de> writes:
>> I wonder if we shouldn't just expose a 'which pid is process X waiting
>> for' API, implemented serverside. That's generally really useful, and
>> looks like it's actually going to be less complicated than that
>> query... And it's surely going to be faster.
>
> Attached is a draft patch for a new function that reports the set of PIDs
> directly blocking a given PID (that is, holding or awaiting conflicting
> locks on the lockable object it's waiting for).
>
> I replaced isolationtester's pg_locks query with this, and found that
> it's about 9X faster in a normal build, and 3X faster with
> CLOBBER_CACHE_ALWAYS turned on.  That would give us some nice headroom
> for the isolation tests with CLOBBER_CACHE_ALWAYS animals.  (Note that
> in view of
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=markhor&dt=2016-02-14%2007%3A38%3A37
> we still need to do *something* about the speed of the new deadlock-hard
> test; this patch could avoid the need to dumb down or slow down that test
> even further.)
>
> Not to be neglected also is that (I believe) this gives the right answer,
> whereas isolationtester's existing query is currently completely broken by
> parallel queries, and it doesn't understand non-conflicting lock modes
> either.  (It did, at least partially, before commit 38f8bdcac4982215;
> I am not sure that taking out the mode checks was a good idea.  But
> putting them back would make the query slower yet.)

The reason I took that out is because it breaks the deadlock-soft
test.  It's possible to have a situation where no granted lock
conflicts with an awaited lock.  If that happens, the old query
wrongly concluded that the waiting process was not in fact waiting.
(Consider A hold AccessShareLock, B awaits AccessExclusiveLock, C now
requests AccessShareLock and *waits*.)

As for the patch itself, I'm having trouble grokking what it's trying
to do.  I think it might be worth having a comment defining precisely
what we mean by "A blocks B".  I would define "A blocks B" in general
as either A holds a lock which conflicts with one sought by B
(hard-blocked) or A awaits a lock which conflicts with one sought by B
and precedes it in the wait queue (soft-blocked).  I have wondered
before if we shouldn't modify pg_locks to expose the wait-queue
ordering; without that, you can't reliably determine in general
whether A soft-blocks B, which means every view anyone has ever
written over pg_locks that purports to say who blocks who is
necessarily buggy.

For parallel queries, there's a further relevant distinction when we
say "A blocks B".  We might mean either that (1) process B cannot
resume execution until the lock conflict is resolved or (2) the group
leader for process B cannot complete the current parallel operation
until the lock conflict is resolved.  If you're trying to figure out
why one particular member of a parallel group is stuck, you want to
answer question #1.  If you're trying to figure out what all the
things that need to get out of the way to finish the query, you want
to answer question #2.  I think this function is aiming to answer
question #2, not question #1, but I'm less clear on the reason behind
that definitional choice.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Feb 16, 2016 at 2:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Not to be neglected also is that (I believe) this gives the right answer,
>> whereas isolationtester's existing query is currently completely broken by
>> parallel queries, and it doesn't understand non-conflicting lock modes
>> either.  (It did, at least partially, before commit 38f8bdcac4982215;
>> I am not sure that taking out the mode checks was a good idea.  But
>> putting them back would make the query slower yet.)

> The reason I took that out is because it breaks the deadlock-soft
> test.  It's possible to have a situation where no granted lock
> conflicts with an awaited lock.  If that happens, the old query
> wrongly concluded that the waiting process was not in fact waiting.
> (Consider A hold AccessShareLock, B awaits AccessExclusiveLock, C now
> requests AccessShareLock and *waits*.)

Ah, well, with this patch the deadlock-soft test still passes.

> As for the patch itself, I'm having trouble grokking what it's trying
> to do.  I think it might be worth having a comment defining precisely
> what we mean by "A blocks B".  I would define "A blocks B" in general
> as either A holds a lock which conflicts with one sought by B
> (hard-blocked) or A awaits a lock which conflicts with one sought by B
> and precedes it in the wait queue (soft-blocked).

Yes, that is exactly what I implemented ... and it's something you can't
find out from pg_locks.  I'm not sure how that view could be made to
expose wait-queue ordering.

> For parallel queries, there's a further relevant distinction when we
> say "A blocks B".  We might mean either that (1) process B cannot
> resume execution until the lock conflict is resolved or (2) the group
> leader for process B cannot complete the current parallel operation
> until the lock conflict is resolved.

The definition I used in this patch is "some member of A's lock group
blocks some member of B's lock group", because that corresponds directly
to whether A is preventing B's query from completing, which is what
isolationtester wants to know --- and, I would argue, it's generally what
any client would want to know.  99.9% of clients would just as soon not be
aware of parallel workers lurking underneath the pg_backend_pid() values
that they see for their sessions.
        regards, tom lane



Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

От
Tom Lane
Дата:
I just had a rather disturbing thought to the effect that this entire
design --- ie, parallel workers taking out locks for themselves --- is
fundamentally flawed.  As far as I can tell from README.parallel,
parallel workers are supposed to exit (and, presumably, release their
locks) before the leader's transaction commits.  Releasing locks before
commit is wrong.  Do I need to rehearse why?
        regards, tom lane



Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

От
Tom Lane
Дата:
I wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> As for the patch itself, I'm having trouble grokking what it's trying
>> to do.  I think it might be worth having a comment defining precisely
>> what we mean by "A blocks B".  I would define "A blocks B" in general
>> as either A holds a lock which conflicts with one sought by B
>> (hard-blocked) or A awaits a lock which conflicts with one sought by B
>> and precedes it in the wait queue (soft-blocked).

> Yes, that is exactly what I implemented ... and it's something you can't
> find out from pg_locks.  I'm not sure how that view could be made to
> expose wait-queue ordering.

Here's an updated version of this patch, now with user-facing docs.

I decided that "pg_blocking_pids()" is a better function name than
"pg_blocker_pids()".  The code's otherwise the same, although I
revisited some of the comments.

I also changed quite a few references to "transaction" into "process"
in the discussion of pg_locks.  The previous choice to conflate
processes with transactions was never terribly wise in my view, and
it's certainly completely broken by parallel query.

            regards, tom lane

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index d77e999..d3270e4 100644
*** a/doc/src/sgml/catalogs.sgml
--- b/doc/src/sgml/catalogs.sgml
***************
*** 8015,8030 ****

    <para>
     The view <structname>pg_locks</structname> provides access to
!    information about the locks held by open transactions within the
     database server.  See <xref linkend="mvcc"> for more discussion
     of locking.
    </para>

    <para>
     <structname>pg_locks</structname> contains one row per active lockable
!    object, requested lock mode, and relevant transaction.  Thus, the same
     lockable object might
!    appear many times, if multiple transactions are holding or waiting
     for locks on it.  However, an object that currently has no locks on it
     will not appear at all.
    </para>
--- 8015,8030 ----

    <para>
     The view <structname>pg_locks</structname> provides access to
!    information about the locks held by active processes within the
     database server.  See <xref linkend="mvcc"> for more discussion
     of locking.
    </para>

    <para>
     <structname>pg_locks</structname> contains one row per active lockable
!    object, requested lock mode, and relevant process.  Thus, the same
     lockable object might
!    appear many times, if multiple processs are holding or waiting
     for locks on it.  However, an object that currently has no locks on it
     will not appear at all.
    </para>
***************
*** 8200,8210 ****

    <para>
     <structfield>granted</structfield> is true in a row representing a lock
!    held by the indicated transaction.  False indicates that this transaction is
!    currently waiting to acquire this lock, which implies that some other
!    transaction is holding a conflicting lock mode on the same lockable object.
!    The waiting transaction will sleep until the other lock is released (or a
!    deadlock situation is detected). A single transaction can be waiting to
     acquire at most one lock at a time.
    </para>

--- 8200,8210 ----

    <para>
     <structfield>granted</structfield> is true in a row representing a lock
!    held by the indicated process.  False indicates that this process is
!    currently waiting to acquire this lock, which implies that at least one other
!    process is holding a conflicting lock mode on the same lockable object.
!    The waiting process will sleep until the other lock is released (or a
!    deadlock situation is detected). A single process can be waiting to
     acquire at most one lock at a time.
    </para>

***************
*** 8224,8230 ****
     Although tuples are a lockable type of object,
     information about row-level locks is stored on disk, not in memory,
     and therefore row-level locks normally do not appear in this view.
!    If a transaction is waiting for a
     row-level lock, it will usually appear in the view as waiting for the
     permanent transaction ID of the current holder of that row lock.
    </para>
--- 8224,8230 ----
     Although tuples are a lockable type of object,
     information about row-level locks is stored on disk, not in memory,
     and therefore row-level locks normally do not appear in this view.
!    If a process is waiting for a
     row-level lock, it will usually appear in the view as waiting for the
     permanent transaction ID of the current holder of that row lock.
    </para>
*************** SELECT * FROM pg_locks pl LEFT JOIN pg_p
*** 8281,8286 ****
--- 8281,8300 ----
    </para>

    <para>
+    While it is possible to obtain information about which processes block
+    which other processes by joining <structname>pg_locks</structname> against
+    itself, this is very difficult to get right in detail.  Such a query would
+    have to encode knowledge about which lock modes conflict with which
+    others.  Worse, the <structname>pg_locks</structname> view does not expose
+    information about which processes are ahead of which others in lock wait
+    queues, nor information about which processes are parallel workers running
+    on behalf of which other client sessions.  It is better to use
+    the <function>pg_blocking_pids()</> function
+    (see <xref linkend="functions-info-session-table">) to identify which
+    process(es) a waiting process is blocked behind.
+   </para>
+
+   <para>
     The <structname>pg_locks</structname> view displays data from both the
     regular lock manager and the predicate lock manager, which are
     separate systems; in addition, the regular lock manager subdivides its
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index b001ce5..2d9fe7f 100644
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
*************** SELECT * FROM pg_ls_dir('.') WITH ORDINA
*** 14997,15002 ****
--- 14997,15008 ----
        </row>

        <row>
+        <entry><literal><function>pg_blocking_pids(<type>int</type>)</function></literal></entry>
+        <entry><type>int[]</type></entry>
+        <entry>Process ID(s) that are blocking specified server process ID</entry>
+       </row>
+
+       <row>
         <entry><literal><function>pg_conf_load_time()</function></literal></entry>
         <entry><type>timestamp with time zone</type></entry>
         <entry>configuration load time</entry>
*************** SET search_path TO <replaceable>schema</
*** 15184,15189 ****
--- 15190,15215 ----
     </para>

     <indexterm>
+     <primary>pg_blocking_pids</primary>
+    </indexterm>
+
+    <para>
+     <function>pg_blocking_pids</function> returns an array of the process IDs
+     of the sessions that are blocking the server process with the specified
+     process ID, or an empty array if there is no such server process or it is
+     not blocked.  One server process blocks another if it either holds a lock
+     that conflicts with the blocked process's lock request (hard block), or is
+     waiting for a lock that would conflict with the blocked process's lock
+     request and is ahead of it in the wait queue (soft block).  When using
+     parallel queries the result always lists client-visible process IDs (that
+     is, <function>pg_backend_pid</> results) even if the actual lock is held
+     or awaited by a child worker process.  As a result of that, there may be
+     duplicated PIDs in the result.  Also note that when a prepared transaction
+     holds a conflicting lock, it will be represented by a zero process ID in
+     the result of this function.
+    </para>
+
+    <indexterm>
      <primary>pg_conf_load_time</primary>
     </indexterm>

diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 91218d0..97e8962 100644
*** a/src/backend/storage/ipc/procarray.c
--- b/src/backend/storage/ipc/procarray.c
*************** HaveVirtualXIDsDelayingChkpt(VirtualTran
*** 2313,2318 ****
--- 2313,2341 ----
  PGPROC *
  BackendPidGetProc(int pid)
  {
+     PGPROC       *result;
+
+     if (pid == 0)                /* never match dummy PGPROCs */
+         return NULL;
+
+     LWLockAcquire(ProcArrayLock, LW_SHARED);
+
+     result = BackendPidGetProcWithLock(pid);
+
+     LWLockRelease(ProcArrayLock);
+
+     return result;
+ }
+
+ /*
+  * BackendPidGetProcWithLock -- get a backend's PGPROC given its PID
+  *
+  * Same as above, except caller must be holding ProcArrayLock.  The found
+  * entry, if any, can be assumed to be valid as long as the lock remains held.
+  */
+ PGPROC *
+ BackendPidGetProcWithLock(int pid)
+ {
      PGPROC       *result = NULL;
      ProcArrayStruct *arrayP = procArray;
      int            index;
*************** BackendPidGetProc(int pid)
*** 2320,2327 ****
      if (pid == 0)                /* never match dummy PGPROCs */
          return NULL;

-     LWLockAcquire(ProcArrayLock, LW_SHARED);
-
      for (index = 0; index < arrayP->numProcs; index++)
      {
          PGPROC       *proc = &allProcs[arrayP->pgprocnos[index]];
--- 2343,2348 ----
*************** BackendPidGetProc(int pid)
*** 2333,2340 ****
          }
      }

-     LWLockRelease(ProcArrayLock);
-
      return result;
  }

--- 2354,2359 ----
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index fef59a2..9eb4dfb 100644
*** a/src/backend/storage/lmgr/lock.c
--- b/src/backend/storage/lmgr/lock.c
***************
*** 21,27 ****
   *
   *    Interface:
   *
!  *    InitLocks(), GetLocksMethodTable(),
   *    LockAcquire(), LockRelease(), LockReleaseAll(),
   *    LockCheckConflicts(), GrantLock()
   *
--- 21,27 ----
   *
   *    Interface:
   *
!  *    InitLocks(), GetLocksMethodTable(), GetLockTagsMethodTable(),
   *    LockAcquire(), LockRelease(), LockReleaseAll(),
   *    LockCheckConflicts(), GrantLock()
   *
***************
*** 41,46 ****
--- 41,47 ----
  #include "pg_trace.h"
  #include "pgstat.h"
  #include "storage/proc.h"
+ #include "storage/procarray.h"
  #include "storage/sinvaladt.h"
  #include "storage/spin.h"
  #include "storage/standby.h"
*************** static void CleanUpLock(LOCK *lock, PROC
*** 356,361 ****
--- 357,364 ----
  static void LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
                       LOCKTAG *locktag, LOCKMODE lockmode,
                       bool decrement_strong_lock_count);
+ static void GetSingleProcBlockerStatusData(PGPROC *blocked_proc,
+                                BlockedProcsData *data);


  /*
*************** GetLocksMethodTable(const LOCK *lock)
*** 462,467 ****
--- 465,482 ----
      return LockMethods[lockmethodid];
  }

+ /*
+  * Fetch the lock method table associated with a given locktag
+  */
+ LockMethod
+ GetLockTagsMethodTable(const LOCKTAG *locktag)
+ {
+     LOCKMETHODID lockmethodid = (LOCKMETHODID) locktag->locktag_lockmethodid;
+
+     Assert(0 < lockmethodid && lockmethodid < lengthof(LockMethods));
+     return LockMethods[lockmethodid];
+ }
+

  /*
   * Compute the hash code associated with a LOCKTAG.
*************** GetLockStatusData(void)
*** 3406,3412 ****
       * impractical (in particular, note MAX_SIMUL_LWLOCKS).  It shouldn't
       * matter too much, because none of these locks can be involved in lock
       * conflicts anyway - anything that might must be present in the main lock
!      * table.
       */
      for (i = 0; i < ProcGlobal->allProcCount; ++i)
      {
--- 3421,3430 ----
       * impractical (in particular, note MAX_SIMUL_LWLOCKS).  It shouldn't
       * matter too much, because none of these locks can be involved in lock
       * conflicts anyway - anything that might must be present in the main lock
!      * table.  (For the same reason, we don't sweat about making leaderPid
!      * completely valid.  We cannot safely dereference another backend's
!      * lockGroupLeader field without holding all lock partition locks, and
!      * it's not worth that.)
       */
      for (i = 0; i < ProcGlobal->allProcCount; ++i)
      {
*************** GetLockStatusData(void)
*** 3439,3444 ****
--- 3457,3463 ----
              instance->backend = proc->backendId;
              instance->lxid = proc->lxid;
              instance->pid = proc->pid;
+             instance->leaderPid = proc->pid;
              instance->fastpath = true;

              el++;
*************** GetLockStatusData(void)
*** 3466,3471 ****
--- 3485,3491 ----
              instance->backend = proc->backendId;
              instance->lxid = proc->lxid;
              instance->pid = proc->pid;
+             instance->leaderPid = proc->pid;
              instance->fastpath = true;

              el++;
*************** GetLockStatusData(void)
*** 3517,3522 ****
--- 3537,3543 ----
          instance->backend = proc->backendId;
          instance->lxid = proc->lxid;
          instance->pid = proc->pid;
+         instance->leaderPid = proclock->groupLeader->pid;
          instance->fastpath = false;

          el++;
*************** GetLockStatusData(void)
*** 3538,3543 ****
--- 3559,3755 ----
  }

  /*
+  * GetBlockerStatusData - Return a summary of the lock manager's state
+  * concerning locks that are blocking the specified PID or any member of
+  * the PID's lock group, for use in a user-level reporting function.
+  *
+  * For each PID within the lock group that is awaiting some heavyweight lock,
+  * the return data includes an array of LockInstanceData objects, which are
+  * the same data structure used by GetLockStatusData; but unlike that function,
+  * this one reports only the PROCLOCKs associated with the lock that that pid
+  * is blocked on.  (Hence, all the locktags should be the same for any one
+  * blocked PID.)  In addition, we return an array of the PIDs of those backends
+  * that are ahead of the blocked PID in the lock's wait queue.  These can be
+  * compared with the pids in the LockInstanceData objects to determine which
+  * waiters are ahead of or behind the blocked PID in the queue.
+  *
+  * If blocked_pid isn't a valid backend PID or nothing in its lock group is
+  * waiting on any heavyweight lock, return empty arrays.
+  *
+  * The design goal is to hold the LWLocks for as short a time as possible;
+  * thus, this function simply makes a copy of the necessary data and releases
+  * the locks, allowing the caller to contemplate and format the data for as
+  * long as it pleases.
+  */
+ BlockedProcsData *
+ GetBlockerStatusData(int blocked_pid)
+ {
+     BlockedProcsData *data;
+     PGPROC       *proc;
+     int            i;
+
+     data = (BlockedProcsData *) palloc(sizeof(BlockedProcsData));
+
+     /*
+      * Guess how much space we'll need, and preallocate.  Most of the time
+      * this will avoid needing to do repalloc while holding the LWLocks.  (We
+      * assume, but check with an Assert, that MaxBackends is enough entries
+      * for the procs[] array; the other two could need enlargement, though.)
+      */
+     data->nprocs = data->nlocks = data->npids = 0;
+     data->maxprocs = data->maxlocks = data->maxpids = MaxBackends;
+     data->procs = (BlockedProcData *) palloc(sizeof(BlockedProcData) * data->maxprocs);
+     data->locks = (LockInstanceData *) palloc(sizeof(LockInstanceData) * data->maxlocks);
+     data->waiter_pids = (int *) palloc(sizeof(int) * data->maxpids);
+
+     /*
+      * In order to search the ProcArray for blocked_pid and assume that that
+      * entry won't immediately disappear under us, we must hold ProcArrayLock.
+      * In addition, to examine the lock grouping fields of any other backend,
+      * we must hold all the hash partition locks.  (Only one of those locks is
+      * actually relevant for any one lock group, but we can't know which one
+      * ahead of time.)    It's fairly annoying to hold all those locks
+      * throughout this, but it's no worse than GetLockStatusData(), and it
+      * does have the advantage that we're guaranteed to return a
+      * self-consistent instantaneous state.
+      */
+     LWLockAcquire(ProcArrayLock, LW_SHARED);
+
+     proc = BackendPidGetProcWithLock(blocked_pid);
+
+     /* Nothing to do if it's gone */
+     if (proc != NULL)
+     {
+         /*
+          * Acquire lock on the entire shared lock data structure.  See notes
+          * in GetLockStatusData().
+          */
+         for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
+             LWLockAcquire(LockHashPartitionLockByIndex(i), LW_SHARED);
+
+         if (proc->lockGroupLeader == NULL)
+         {
+             /* Easy case, proc is not a lock group member */
+             GetSingleProcBlockerStatusData(proc, data);
+         }
+         else
+         {
+             /* Examine all procs in proc's lock group */
+             dlist_iter    iter;
+
+             dlist_foreach(iter, &proc->lockGroupLeader->lockGroupMembers)
+             {
+                 PGPROC       *memberProc;
+
+                 memberProc = dlist_container(PGPROC, lockGroupLink, iter.cur);
+                 GetSingleProcBlockerStatusData(memberProc, data);
+             }
+         }
+
+         /*
+          * And release locks.  See notes in GetLockStatusData().
+          */
+         for (i = NUM_LOCK_PARTITIONS; --i >= 0;)
+             LWLockRelease(LockHashPartitionLockByIndex(i));
+
+         Assert(data->nprocs <= data->maxprocs);
+     }
+
+     LWLockRelease(ProcArrayLock);
+
+     return data;
+ }
+
+ /* Accumulate data about one possibly-blocked proc for GetBlockerStatusData */
+ static void
+ GetSingleProcBlockerStatusData(PGPROC *blocked_proc, BlockedProcsData *data)
+ {
+     LOCK       *theLock = blocked_proc->waitLock;
+     BlockedProcData *bproc;
+     SHM_QUEUE  *procLocks;
+     PROCLOCK   *proclock;
+     PROC_QUEUE *waitQueue;
+     PGPROC       *proc;
+     int            queue_size;
+     int            i;
+
+     /* Nothing to do if this proc is not blocked */
+     if (theLock == NULL)
+         return;
+
+     /* Set up a procs[] element */
+     bproc = &data->procs[data->nprocs++];
+     bproc->pid = blocked_proc->pid;
+     bproc->first_lock = data->nlocks;
+     bproc->first_waiter = data->npids;
+
+     /*
+      * We may ignore the proc's fast-path arrays, since nothing in those could
+      * be related to a contended lock.
+      */
+
+     /* Collect all PROCLOCKs associated with theLock */
+     procLocks = &(theLock->procLocks);
+     proclock = (PROCLOCK *) SHMQueueNext(procLocks, procLocks,
+                                          offsetof(PROCLOCK, lockLink));
+     while (proclock)
+     {
+         PGPROC       *proc = proclock->tag.myProc;
+         LOCK       *lock = proclock->tag.myLock;
+         LockInstanceData *instance;
+
+         if (data->nlocks >= data->maxlocks)
+         {
+             data->maxlocks += MaxBackends;
+             data->locks = (LockInstanceData *)
+                 repalloc(data->locks, sizeof(LockInstanceData) * data->maxlocks);
+         }
+
+         instance = &data->locks[data->nlocks];
+         memcpy(&instance->locktag, &lock->tag, sizeof(LOCKTAG));
+         instance->holdMask = proclock->holdMask;
+         if (proc->waitLock == lock)
+             instance->waitLockMode = proc->waitLockMode;
+         else
+             instance->waitLockMode = NoLock;
+         instance->backend = proc->backendId;
+         instance->lxid = proc->lxid;
+         instance->pid = proc->pid;
+         instance->leaderPid = proclock->groupLeader->pid;
+         instance->fastpath = false;
+         data->nlocks++;
+
+         proclock = (PROCLOCK *) SHMQueueNext(procLocks, &proclock->lockLink,
+                                              offsetof(PROCLOCK, lockLink));
+     }
+
+     /* Enlarge waiter_pids[] if it's too small to hold all wait queue PIDs */
+     waitQueue = &(theLock->waitProcs);
+     queue_size = waitQueue->size;
+
+     if (queue_size > data->maxpids - data->npids)
+     {
+         data->maxpids = Max(data->maxpids + MaxBackends,
+                             data->npids + queue_size);
+         data->waiter_pids = (int *) repalloc(data->waiter_pids,
+                                              sizeof(int) * data->maxpids);
+     }
+
+     /* Collect PIDs from the lock's wait queue, stopping at blocked_proc */
+     proc = (PGPROC *) waitQueue->links.next;
+     for (i = 0; i < queue_size; i++)
+     {
+         if (proc == blocked_proc)
+             break;
+         data->waiter_pids[data->npids++] = proc->pid;
+         proc = (PGPROC *) proc->links.next;
+     }
+
+     bproc->num_locks = data->nlocks - bproc->first_lock;
+     bproc->num_waiters = data->npids - bproc->first_waiter;
+ }
+
+ /*
   * Returns a list of currently held AccessExclusiveLocks, for use by
   * LogStandbySnapshot().  The result is a palloc'd array,
   * with the number of elements returned into *nlocks.
diff --git a/src/backend/utils/adt/lockfuncs.c b/src/backend/utils/adt/lockfuncs.c
index 73c78e9..6bcab81 100644
*** a/src/backend/utils/adt/lockfuncs.c
--- b/src/backend/utils/adt/lockfuncs.c
***************
*** 18,23 ****
--- 18,24 ----
  #include "funcapi.h"
  #include "miscadmin.h"
  #include "storage/predicate_internals.h"
+ #include "utils/array.h"
  #include "utils/builtins.h"


*************** pg_lock_status(PG_FUNCTION_ARGS)
*** 99,105 ****
          oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);

          /* build tupdesc for result tuples */
!         /* this had better match pg_locks view in system_views.sql */
          tupdesc = CreateTemplateTupleDesc(NUM_LOCK_STATUS_COLUMNS, false);
          TupleDescInitEntry(tupdesc, (AttrNumber) 1, "locktype",
                             TEXTOID, -1, 0);
--- 100,106 ----
          oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);

          /* build tupdesc for result tuples */
!         /* this had better match function's declaration in pg_proc.h */
          tupdesc = CreateTemplateTupleDesc(NUM_LOCK_STATUS_COLUMNS, false);
          TupleDescInitEntry(tupdesc, (AttrNumber) 1, "locktype",
                             TEXTOID, -1, 0);
*************** pg_lock_status(PG_FUNCTION_ARGS)
*** 395,400 ****
--- 396,523 ----


  /*
+  * pg_blocking_pids - produce an array of the PIDs blocking given PID
+  *
+  * The reported PIDs are those that hold a lock conflicting with blocked_pid's
+  * current request (hard block), or are requesting such a lock and are ahead
+  * of blocked_pid in the lock's wait queue (soft block).
+  *
+  * In parallel-query cases, we report all PIDs blocking any member of the
+  * given PID's lock group, and the reported PIDs are those of the blocking
+  * PIDs' lock group leaders.  This allows callers to compare the result to
+  * lists of clients' pg_backend_pid() results even during a parallel query.
+  *
+  * Parallel query makes it possible for there to be duplicate PIDs in the
+  * result (either because multiple waiters are blocked by same PID, or
+  * because multiple blockers have same group leader PID).  We do not bother
+  * to eliminate such duplicates from the result.
+  *
+  * We need not consider predicate locks here, since those don't block anything.
+  */
+ Datum
+ pg_blocking_pids(PG_FUNCTION_ARGS)
+ {
+     int            blocked_pid = PG_GETARG_INT32(0);
+     Datum       *arrayelems;
+     int            narrayelems;
+     BlockedProcsData *lockData; /* state data from lmgr */
+     int            i,
+                 j;
+
+     /* Collect a snapshot of lock manager state */
+     lockData = GetBlockerStatusData(blocked_pid);
+
+     /* We can't need more output entries than there are reported PROCLOCKs */
+     arrayelems = (Datum *) palloc(lockData->nlocks * sizeof(Datum));
+     narrayelems = 0;
+
+     /* For each blocked proc in the lock group ... */
+     for (i = 0; i < lockData->nprocs; i++)
+     {
+         BlockedProcData *bproc = &lockData->procs[i];
+         LockInstanceData *instances = &lockData->locks[bproc->first_lock];
+         int           *preceding_waiters = &lockData->waiter_pids[bproc->first_waiter];
+         LockInstanceData *blocked_instance;
+         LockMethod    lockMethodTable;
+         int            conflictMask;
+
+         /*
+          * Locate the blocked proc's own entry in the LockInstanceData array.
+          * There should be exactly one matching entry.
+          */
+         blocked_instance = NULL;
+         for (j = 0; j < bproc->num_locks; j++)
+         {
+             LockInstanceData *instance = &(instances[j]);
+
+             if (instance->pid == bproc->pid)
+             {
+                 Assert(blocked_instance == NULL);
+                 blocked_instance = instance;
+             }
+         }
+         Assert(blocked_instance != NULL);
+
+         lockMethodTable = GetLockTagsMethodTable(&(blocked_instance->locktag));
+         conflictMask = lockMethodTable->conflictTab[blocked_instance->waitLockMode];
+
+         /* Now scan the PROCLOCK data for conflicting procs */
+         for (j = 0; j < bproc->num_locks; j++)
+         {
+             LockInstanceData *instance = &(instances[j]);
+
+             /* A proc never blocks itself, so ignore that entry */
+             if (instance == blocked_instance)
+                 continue;
+             /* Members of same lock group never block each other, either */
+             if (instance->leaderPid == blocked_instance->leaderPid)
+                 continue;
+
+             if (conflictMask & instance->holdMask)
+             {
+                 /* hard block: blocked by lock already held by this entry */
+             }
+             else if (instance->waitLockMode != NoLock &&
+                      (conflictMask & LOCKBIT_ON(instance->waitLockMode)))
+             {
+                 /* conflict in lock requests; who's in front in wait queue? */
+                 bool        ahead = false;
+                 int            k;
+
+                 for (k = 0; k < bproc->num_waiters; k++)
+                 {
+                     if (preceding_waiters[k] == instance->pid)
+                     {
+                         /* soft block: this entry is ahead of blocked proc */
+                         ahead = true;
+                         break;
+                     }
+                 }
+                 if (!ahead)
+                     continue;    /* not blocked by this entry */
+             }
+             else
+             {
+                 /* not blocked by this entry */
+                 continue;
+             }
+
+             /* blocked by this entry, so emit a record */
+             arrayelems[narrayelems++] = Int32GetDatum(instance->leaderPid);
+         }
+     }
+
+     /* Assert we didn't overrun arrayelems[] */
+     Assert(narrayelems <= lockData->nlocks);
+
+     /* Construct array, using hardwired knowledge about int4 type */
+     PG_RETURN_ARRAYTYPE_P(construct_array(arrayelems, narrayelems,
+                                           INT4OID,
+                                           sizeof(int32), true, 'i'));
+ }
+
+
+ /*
   * Functions for manipulating advisory locks
   *
   * We make use of the locktag fields as follows:
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 59c50d9..62b9125 100644
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
*************** DATA(insert OID = 3329 (  pg_show_all_fi
*** 3012,3017 ****
--- 3012,3019 ----
  DESCR("show config file settings");
  DATA(insert OID = 1371 (  pg_lock_status   PGNSP PGUID 12 1 1000 0 0 f f f f t t v s 0 0 2249 ""
"{25,26,26,23,21,25,28,26,26,21,25,23,25,16,16}""{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}"
"{locktype,database,relation,page,tuple,virtualxid,transactionid,classid,objid,objsubid,virtualtransaction,pid,mode,granted,fastpath}"
_null__null_ pg_lock_status _null_ _null_ _null_ )); 
  DESCR("view system lock information");
+ DATA(insert OID = 2561 (  pg_blocking_pids PGNSP PGUID 12 1 0 0 0 f f f f t f v s 1 0 1007 "23" _null_ _null_ _null_
_null__null_ pg_blocking_pids _null_ _null_ _null_ )); 
+ DESCR("get array of PIDs of sessions blocking specified backend PID");
  DATA(insert OID = 1065 (  pg_prepared_xact PGNSP PGUID 12 1 1000 0 0 f f f f t t v s 0 0 2249 "" "{28,25,1184,26,26}"
"{o,o,o,o,o}""{transaction,gid,prepared,ownerid,dbid}" _null_ _null_ pg_prepared_xact _null_ _null_ _null_ )); 
  DESCR("view two-phase transactions");
  DATA(insert OID = 3819 (  pg_get_multixact_members PGNSP PGUID 12 1 1000 0 0 f f f f t t v s 1 0 2249 "28"
"{28,28,25}""{i,o,o}" "{multixid,xid,mode}" _null_ _null_ pg_get_multixact_members _null_ _null_ _null_ )); 
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 703eaf2..788d50a 100644
*** a/src/include/storage/lock.h
--- b/src/include/storage/lock.h
*************** typedef struct PROCLOCK
*** 346,352 ****
      PROCLOCKTAG tag;            /* unique identifier of proclock object */

      /* data */
!     PGPROC       *groupLeader;    /* group leader, or NULL if no lock group */
      LOCKMASK    holdMask;        /* bitmask for lock types currently held */
      LOCKMASK    releaseMask;    /* bitmask for lock types to be released */
      SHM_QUEUE    lockLink;        /* list link in LOCK's list of proclocks */
--- 346,352 ----
      PROCLOCKTAG tag;            /* unique identifier of proclock object */

      /* data */
!     PGPROC       *groupLeader;    /* proc's lock group leader, or proc itself */
      LOCKMASK    holdMask;        /* bitmask for lock types currently held */
      LOCKMASK    releaseMask;    /* bitmask for lock types to be released */
      SHM_QUEUE    lockLink;        /* list link in LOCK's list of proclocks */
*************** typedef struct LOCALLOCK
*** 423,443 ****

  typedef struct LockInstanceData
  {
!     LOCKTAG        locktag;        /* locked object */
      LOCKMASK    holdMask;        /* locks held by this PGPROC */
      LOCKMODE    waitLockMode;    /* lock awaited by this PGPROC, if any */
      BackendId    backend;        /* backend ID of this PGPROC */
      LocalTransactionId lxid;    /* local transaction ID of this PGPROC */
      int            pid;            /* pid of this PGPROC */
      bool        fastpath;        /* taken via fastpath? */
  } LockInstanceData;

  typedef struct LockData
  {
      int            nelements;        /* The length of the array */
!     LockInstanceData *locks;
  } LockData;


  /* Result codes for LockAcquire() */
  typedef enum
--- 423,470 ----

  typedef struct LockInstanceData
  {
!     LOCKTAG        locktag;        /* tag for locked object */
      LOCKMASK    holdMask;        /* locks held by this PGPROC */
      LOCKMODE    waitLockMode;    /* lock awaited by this PGPROC, if any */
      BackendId    backend;        /* backend ID of this PGPROC */
      LocalTransactionId lxid;    /* local transaction ID of this PGPROC */
      int            pid;            /* pid of this PGPROC */
+     int            leaderPid;        /* pid of group leader; = pid if no group */
      bool        fastpath;        /* taken via fastpath? */
  } LockInstanceData;

  typedef struct LockData
  {
      int            nelements;        /* The length of the array */
!     LockInstanceData *locks;    /* Array of per-PROCLOCK information */
  } LockData;

+ typedef struct BlockedProcData
+ {
+     int            pid;            /* pid of a blocked PGPROC */
+     /* Per-PROCLOCK information about PROCLOCKs of the lock the pid awaits */
+     /* (these fields refer to indexes in BlockedProcsData.locks[]) */
+     int            first_lock;        /* index of first relevant LockInstanceData */
+     int            num_locks;        /* number of relevant LockInstanceDatas */
+     /* PIDs of PGPROCs that are ahead of "pid" in the lock's wait queue */
+     /* (these fields refer to indexes in BlockedProcsData.waiter_pids[]) */
+     int            first_waiter;    /* index of first preceding waiter */
+     int            num_waiters;    /* number of preceding waiters */
+ } BlockedProcData;
+
+ typedef struct BlockedProcsData
+ {
+     BlockedProcData *procs;        /* Array of per-blocked-proc information */
+     LockInstanceData *locks;    /* Array of per-PROCLOCK information */
+     int           *waiter_pids;    /* Array of PIDs of other blocked PGPROCs */
+     int            nprocs;            /* # of valid entries in procs[] array */
+     int            maxprocs;        /* Allocated length of procs[] array */
+     int            nlocks;            /* # of valid entries in locks[] array */
+     int            maxlocks;        /* Allocated length of locks[] array */
+     int            npids;            /* # of valid entries in waiter_pids[] array */
+     int            maxpids;        /* Allocated length of waiter_pids[] array */
+ } BlockedProcsData;
+

  /* Result codes for LockAcquire() */
  typedef enum
*************** typedef enum
*** 489,494 ****
--- 516,522 ----
   */
  extern void InitLocks(void);
  extern LockMethod GetLocksMethodTable(const LOCK *lock);
+ extern LockMethod GetLockTagsMethodTable(const LOCKTAG *locktag);
  extern uint32 LockTagHashCode(const LOCKTAG *locktag);
  extern bool DoLockModesConflict(LOCKMODE mode1, LOCKMODE mode2);
  extern LockAcquireResult LockAcquire(const LOCKTAG *locktag,
*************** extern void GrantAwaitedLock(void);
*** 521,526 ****
--- 549,555 ----
  extern void RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode);
  extern Size LockShmemSize(void);
  extern LockData *GetLockStatusData(void);
+ extern BlockedProcsData *GetBlockerStatusData(int blocked_pid);

  extern xl_standby_lock *GetRunningTransactionLocks(int *nlocks);
  extern const char *GetLockmodeName(LOCKMETHODID lockmethodid, LOCKMODE mode);
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 1fbf4f3..dd37c0c 100644
*** a/src/include/storage/procarray.h
--- b/src/include/storage/procarray.h
*************** extern VirtualTransactionId *GetVirtualX
*** 61,66 ****
--- 61,67 ----
  extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids);

  extern PGPROC *BackendPidGetProc(int pid);
+ extern PGPROC *BackendPidGetProcWithLock(int pid);
  extern int    BackendXidGetPid(TransactionId xid);
  extern bool IsBackendPid(int pid);

diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 94c1881..7ec93c9 100644
*** a/src/include/utils/builtins.h
--- b/src/include/utils/builtins.h
*************** extern Datum row_security_active_name(PG
*** 1157,1162 ****
--- 1157,1163 ----

  /* lockfuncs.c */
  extern Datum pg_lock_status(PG_FUNCTION_ARGS);
+ extern Datum pg_blocking_pids(PG_FUNCTION_ARGS);
  extern Datum pg_advisory_lock_int8(PG_FUNCTION_ARGS);
  extern Datum pg_advisory_xact_lock_int8(PG_FUNCTION_ARGS);
  extern Datum pg_advisory_lock_shared_int8(PG_FUNCTION_ARGS);
diff --git a/src/test/isolation/isolationtester.c b/src/test/isolation/isolationtester.c
index 0a9d25c..6461ae8 100644
*** a/src/test/isolation/isolationtester.c
--- b/src/test/isolation/isolationtester.c
*************** main(int argc, char **argv)
*** 227,253 ****
       */
      initPQExpBuffer(&wait_query);
      appendPQExpBufferStr(&wait_query,
!                          "SELECT 1 FROM pg_locks holder, pg_locks waiter "
!                          "WHERE NOT waiter.granted AND waiter.pid = $1 "
!                          "AND holder.granted "
!                          "AND holder.pid <> $1 AND holder.pid IN (");
      /* The spec syntax requires at least one session; assume that here. */
      appendPQExpBufferStr(&wait_query, backend_pids[1]);
      for (i = 2; i < nconns; i++)
!         appendPQExpBuffer(&wait_query, ", %s", backend_pids[i]);
!     appendPQExpBufferStr(&wait_query,
!                          ") "
!
!                   "AND holder.locktype IS NOT DISTINCT FROM waiter.locktype "
!                   "AND holder.database IS NOT DISTINCT FROM waiter.database "
!                   "AND holder.relation IS NOT DISTINCT FROM waiter.relation "
!                          "AND holder.page IS NOT DISTINCT FROM waiter.page "
!                          "AND holder.tuple IS NOT DISTINCT FROM waiter.tuple "
!               "AND holder.virtualxid IS NOT DISTINCT FROM waiter.virtualxid "
!         "AND holder.transactionid IS NOT DISTINCT FROM waiter.transactionid "
!                     "AND holder.classid IS NOT DISTINCT FROM waiter.classid "
!                          "AND holder.objid IS NOT DISTINCT FROM waiter.objid "
!                 "AND holder.objsubid IS NOT DISTINCT FROM waiter.objsubid ");

      res = PQprepare(conns[0], PREP_WAITING, wait_query.data, 0, NULL);
      if (PQresultStatus(res) != PGRES_COMMAND_OK)
--- 227,238 ----
       */
      initPQExpBuffer(&wait_query);
      appendPQExpBufferStr(&wait_query,
!                          "SELECT pg_catalog.pg_blocking_pids($1) && '{");
      /* The spec syntax requires at least one session; assume that here. */
      appendPQExpBufferStr(&wait_query, backend_pids[1]);
      for (i = 2; i < nconns; i++)
!         appendPQExpBuffer(&wait_query, ",%s", backend_pids[i]);
!     appendPQExpBufferStr(&wait_query, "}'::integer[]");

      res = PQprepare(conns[0], PREP_WAITING, wait_query.data, 0, NULL);
      if (PQresultStatus(res) != PGRES_COMMAND_OK)
*************** try_complete_step(Step *step, int flags)
*** 745,765 ****
              /* If it's OK for the step to block, check whether it has. */
              if (flags & STEP_NONBLOCK)
              {
!                 int            ntuples;

                  res = PQexecPrepared(conns[0], PREP_WAITING, 1,
                                       &backend_pids[step->session + 1],
                                       NULL, NULL, 0);
!                 if (PQresultStatus(res) != PGRES_TUPLES_OK)
                  {
                      fprintf(stderr, "lock wait query failed: %s",
                              PQerrorMessage(conn));
                      exit_nicely();
                  }
!                 ntuples = PQntuples(res);
                  PQclear(res);

!                 if (ntuples >= 1)        /* waiting to acquire a lock */
                  {
                      if (!(flags & STEP_RETRY))
                          printf("step %s: %s <waiting ...>\n",
--- 730,751 ----
              /* If it's OK for the step to block, check whether it has. */
              if (flags & STEP_NONBLOCK)
              {
!                 bool        waiting;

                  res = PQexecPrepared(conns[0], PREP_WAITING, 1,
                                       &backend_pids[step->session + 1],
                                       NULL, NULL, 0);
!                 if (PQresultStatus(res) != PGRES_TUPLES_OK ||
!                     PQntuples(res) != 1)
                  {
                      fprintf(stderr, "lock wait query failed: %s",
                              PQerrorMessage(conn));
                      exit_nicely();
                  }
!                 waiting = ((PQgetvalue(res, 0, 0))[0] == 't');
                  PQclear(res);

!                 if (waiting)    /* waiting to acquire a lock */
                  {
                      if (!(flags & STEP_RETRY))
                          printf("step %s: %s <waiting ...>\n",

Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

От
Robert Haas
Дата:
On Wed, Feb 17, 2016 at 9:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I just had a rather disturbing thought to the effect that this entire
> design --- ie, parallel workers taking out locks for themselves --- is
> fundamentally flawed.  As far as I can tell from README.parallel,
> parallel workers are supposed to exit (and, presumably, release their
> locks) before the leader's transaction commits.  Releasing locks before
> commit is wrong.  Do I need to rehearse why?

No, you don't.  I've spent a good deal of time thinking about that problem.

In typical cases, workers are going to be acquiring either catalog
locks (which are released before commit) or locks on relations which
the leader has already locked (in which case the leader will still
hold the lock - or possibly a stronger one - even after the worker
releases that lock).  Suppose, however, that you write a function
which goes and queries some other table not involved in the query, and
therefore acquires a lock on it.  If you mark that function PARALLEL
SAFE and it runs only in the worker and not in in the leader, then you
could end up with a parallel query that releases the lock before
commit where a non-parallel version of that query would have held the
lock until transaction commit.  Of course, one answer to this problem
is - if the early lock release is apt to be a problem for you - don't
mark such functions PARALLEL SAFE.

I've thought about engineering a better solution.  Two possible
designs come to mind.  First, we could have the worker send to the
leader a list of locks that it holds at the end of its work, and the
leader could acquire all of those before confirming to the worker that
it is OK to terminate.  That has some noteworthy disadvantages, like
being prone to deadlock and requiring workers to stick around
potentially quite a bit longer than they do at present, thus limiting
the ability of other processes to access parallel query.  Second, we
could have the workers reassign all of their locks to the leader in
the lock table (unless the leader already holds that lock).  The
problem with that is that then the leader is in the weird situation of
having locks in the shared lock table that it doesn't know anything
about - they don't appear in it's local lock table.  How does the
leader decide which resource owner they go with?

Unless I'm missing something, though, this is a fairly obscure
problem.  Early release of catalog locks is desirable, and locks on
scanned tables should be the same locks (or weaker) than already held
by the master.  Other cases are rare.  I think.  It would be good to
know if you think otherwise.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

От
Robert Haas
Дата:
On Mon, Feb 22, 2016 at 2:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> As for the patch itself, I'm having trouble grokking what it's trying
>>> to do.  I think it might be worth having a comment defining precisely
>>> what we mean by "A blocks B".  I would define "A blocks B" in general
>>> as either A holds a lock which conflicts with one sought by B
>>> (hard-blocked) or A awaits a lock which conflicts with one sought by B
>>> and precedes it in the wait queue (soft-blocked).
>
>> Yes, that is exactly what I implemented ... and it's something you can't
>> find out from pg_locks.  I'm not sure how that view could be made to
>> expose wait-queue ordering.
>
> Here's an updated version of this patch, now with user-facing docs.
>
> I decided that "pg_blocking_pids()" is a better function name than
> "pg_blocker_pids()".  The code's otherwise the same, although I
> revisited some of the comments.
>
> I also changed quite a few references to "transaction" into "process"
> in the discussion of pg_locks.  The previous choice to conflate
> processes with transactions was never terribly wise in my view, and
> it's certainly completely broken by parallel query.

!    held by the indicated process.  False indicates that this process is
!    currently waiting to acquire this lock, which implies that at
least one other
!    process is holding a conflicting lock mode on the same lockable object.

I know you're just updating existing language here, but this is false.
It only implies that one other process is holding *or waiting for* a
conflicting lock mode on the same lockable object.  Other than that, I
think the documentation changes look good.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Feb 17, 2016 at 9:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I just had a rather disturbing thought to the effect that this entire
>> design --- ie, parallel workers taking out locks for themselves --- is
>> fundamentally flawed.  As far as I can tell from README.parallel,
>> parallel workers are supposed to exit (and, presumably, release their
>> locks) before the leader's transaction commits.  Releasing locks before
>> commit is wrong.  Do I need to rehearse why?

> No, you don't.  I've spent a good deal of time thinking about that problem.
> [ much snipped ]
> Unless I'm missing something, though, this is a fairly obscure
> problem.  Early release of catalog locks is desirable, and locks on
> scanned tables should be the same locks (or weaker) than already held
> by the master.  Other cases are rare.  I think.  It would be good to
> know if you think otherwise.

After further thought, what I think about this is that it's safe so long
as parallel workers are strictly read-only.  Given that, early lock
release after user table access is okay for the same reasons it's okay
after catalog accesses.  However, this is one of the big problems that
we'd have to have a solution for before we ever consider allowing
read-write parallelism.

So what distresses me about the current situation is that this is a large
stumbling block that I don't see documented anywhere.  It'd be good if you
transcribed some of this conversation into README.parallel.

(BTW, I don't believe your statement that transferring locks back to the
master would be deadlock-prone.  If the lock system treats locks held by
a lock group as effectively all belonging to one entity, then granting a
lock identical to one already held by another group member should never
fail.  I concur that it might be expensive performance-wise, though it
hardly seems like this would be a dominant factor compared to all the
other costs of setting up and tearing down parallel workers.)
        regards, tom lane



Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Feb 22, 2016 at 2:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> !    held by the indicated process.  False indicates that this process is
> !    currently waiting to acquire this lock, which implies that at
> least one other
> !    process is holding a conflicting lock mode on the same lockable object.

> I know you're just updating existing language here, but this is false.
> It only implies that one other process is holding *or waiting for* a
> conflicting lock mode on the same lockable object.

True.  I had considered whether to fix that point as well, and decided
that it might just be overcomplicating matters.  But since you complain,
I'll add "or waiting for".

It also occurred to me last night that pg_blocking_pids() needs a
disclaimer similar to the existing one for pg_locks about how using it
a lot could put a performance drag on the system.

Other than adjusting those points, I think this is ready to go, and
will commit later today if I hear no objections.
        regards, tom lane



Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

От
Stephen Frost
Дата:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Wed, Feb 17, 2016 at 9:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I just had a rather disturbing thought to the effect that this entire
> >> design --- ie, parallel workers taking out locks for themselves --- is
> >> fundamentally flawed.  As far as I can tell from README.parallel,
> >> parallel workers are supposed to exit (and, presumably, release their
> >> locks) before the leader's transaction commits.  Releasing locks before
> >> commit is wrong.  Do I need to rehearse why?
>
> > No, you don't.  I've spent a good deal of time thinking about that problem.
> > [ much snipped ]
> > Unless I'm missing something, though, this is a fairly obscure
> > problem.  Early release of catalog locks is desirable, and locks on
> > scanned tables should be the same locks (or weaker) than already held
> > by the master.  Other cases are rare.  I think.  It would be good to
> > know if you think otherwise.
>
> After further thought, what I think about this is that it's safe so long
> as parallel workers are strictly read-only.  Given that, early lock
> release after user table access is okay for the same reasons it's okay
> after catalog accesses.  However, this is one of the big problems that
> we'd have to have a solution for before we ever consider allowing
> read-write parallelism.

Having such a blocker for read-write parallelism would be unfortunate,
though perhaps there isn't much help for it, and having read-only query
parallelism is certainly far better than nothing.

> So what distresses me about the current situation is that this is a large
> stumbling block that I don't see documented anywhere.  It'd be good if you
> transcribed some of this conversation into README.parallel.

Agreed.

> (BTW, I don't believe your statement that transferring locks back to the
> master would be deadlock-prone.  If the lock system treats locks held by
> a lock group as effectively all belonging to one entity, then granting a
> lock identical to one already held by another group member should never
> fail.  I concur that it might be expensive performance-wise, though it
> hardly seems like this would be a dominant factor compared to all the
> other costs of setting up and tearing down parallel workers.)

This is only when a parallel worker is finished, no?  Isn't there
already some indication of when a parallel worker is done that the
master handles, where it could also check the shared lock table and see
if any locks were transferred to it on worker exit?

Only following this thread from afar, so take my suggestions with a
grain of salt.

Thanks!

Stephen

Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

От
Tom Lane
Дата:
Stephen Frost <sfrost@snowman.net> writes:
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>> ... However, this is one of the big problems that
>> we'd have to have a solution for before we ever consider allowing
>> read-write parallelism.

> Having such a blocker for read-write parallelism would be unfortunate,
> though perhaps there isn't much help for it, and having read-only query
> parallelism is certainly far better than nothing.

IIUC, this is not the only large problem standing between us and
read-write parallelism, and probably not even the biggest one.
So I'm basically just asking that it gets documented while it's
fresh in mind, rather than leaving it for some future hackers to
rediscover the hard way.  (Wouldn't be bad to doc the other known
stumbling blocks, too.)
        regards, tom lane



Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

От
Robert Haas
Дата:
On Mon, Feb 22, 2016 at 7:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> No, you don't.  I've spent a good deal of time thinking about that problem.
>> [ much snipped ]
>> Unless I'm missing something, though, this is a fairly obscure
>> problem.  Early release of catalog locks is desirable, and locks on
>> scanned tables should be the same locks (or weaker) than already held
>> by the master.  Other cases are rare.  I think.  It would be good to
>> know if you think otherwise.
>
> After further thought, what I think about this is that it's safe so long
> as parallel workers are strictly read-only.  Given that, early lock
> release after user table access is okay for the same reasons it's okay
> after catalog accesses.  However, this is one of the big problems that
> we'd have to have a solution for before we ever consider allowing
> read-write parallelism.

Actually, I don't quite see what read-only vs. read-write queries has
to do with this particular issue.  We retain relation locks on target
relations until commit, regardless of whether those locks are
AccessShareLock, RowShareLock, or RowExclusiveLock.  As far as I
understand it, this isn't because anything would fail horribly if we
released those locks at end of query, but rather because we think that
releasing those locks early might result in unpleasant surprises for
client applications.  I'm actually not really convinced that's true: I
will grant that it might be surprising to run the same query twice in
the same transaction and get different tuple descriptors, but it might
also be surprising to get different rows, which READ COMMITTED allows
anyway.  And I've met a few users who were pretty surprised to find
out that they couldn't do DDL on table A and the blocking session
mentioned table A nowhere in the currently-executing query.

The main issues with allowing read-write parallelism that I know of
off-hand are:

* Updates or deletes might create new combo CIDs.  In order to handle
that, we'd need to store the combo CID mapping in some sort of
DSM-based data structure which could expand as new combo CIDs were
generated.

* Relation extension locks, and a few other types of heavyweight
locks, are used for mutual exclusion of operations that would need to
be serialized even among cooperating backends.  So group locking would
need to be enhanced to handle those cases differently, or some other
solution would need to be found.  (I've done some more detailed
analysis here about possible solutions most of which has been posted
to -hackers in various emails at one time or another; I'll refrain
from diving into all the details in this missive.)

But those are separate from the question of whether parallel workers
need to transfer any heavyweight locks they accumulate on non-scanned
tables back to the leader.

> So what distresses me about the current situation is that this is a large
> stumbling block that I don't see documented anywhere.  It'd be good if you
> transcribed some of this conversation into README.parallel.
>
> (BTW, I don't believe your statement that transferring locks back to the
> master would be deadlock-prone.  If the lock system treats locks held by
> a lock group as effectively all belonging to one entity, then granting a
> lock identical to one already held by another group member should never
> fail.  I concur that it might be expensive performance-wise, though it
> hardly seems like this would be a dominant factor compared to all the
> other costs of setting up and tearing down parallel workers.)

I don't mean that the heavyweight lock acquisition itself would fail;
I agree with your analysis on that.  I mean that you'd have to design
the protocol for the leader and the worker to communicate very
carefully in order for it not to get stuck.  Right now, the leader
initializes the DSM at startup before any workers are running with all
the data the workers will need, and after that data flows strictly
from workers to leader.  So the workers could send control messages
indicating heavyweight locks that they held to the leader, and that
would be fine.  Then the leader would need to read those messages and
do something with them, after which it would need to tell the workers
that they could now exit.  You'd need to make sure there was no
situation in which that handshake couldn't get stuck, for example
because the leader was waiting for a tuple from the worker while the
worker was waiting for a lock-acquisition-confirmation from the
leader.  That particular thing is probably not an issue but hopefully
it illustrates the sort of hazard I'm concerned about.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company