Обсуждение: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.
On Thu, Feb 11, 2016 at 9:29 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <rhaas@postgresql.org> writes: >> Add some isolation tests for deadlock detection and resolution. > > Buildfarm says this needs work ... > > dromedary is one of mine, do you need me to look into what is > happening? That would be great. Taking a look at what happened, I have a feeling this may be a race condition of some kind in the isolation tester. It seems to have failed to recognize that a1 started waiting, and that caused the "deadlock detected" message to reported differently. I'm not immediately sure what to do about that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Feb 11, 2016 at 9:36 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Feb 11, 2016 at 9:29 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Robert Haas <rhaas@postgresql.org> writes: >>> Add some isolation tests for deadlock detection and resolution. >> >> Buildfarm says this needs work ... >> >> dromedary is one of mine, do you need me to look into what is >> happening? > > That would be great. Taking a look at what happened, I have a feeling > this may be a race condition of some kind in the isolation tester. It > seems to have failed to recognize that a1 started waiting, and that > caused the "deadlock detected" message to reported differently. I'm > not immediately sure what to do about that. Yeah, so: try_complete_step() waits 10ms, and if it still hasn't gotten any data back from the server, then it uses a separate query to see whether the step in question is waiting on a lock. So what must've happened here is that it took more than 10ms for the process to show up as waiting in pg_stat_activity. It might be possible to fix this by not passing STEP_NONBLOCK if there's only one connection that isn't waiting. I think I had it like that at one point, and then took it out because it caused some other problem. Another option is to lengthen the timeout. It doesn't seem great to be dependent on a fixed timeout, but the server doesn't send any protocol traffic to indicate a lock wait. If we declared which steps are supposed to wait, then there'd be no ambiguity, but that seems like a drag. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: >> That would be great. Taking a look at what happened, I have a feeling >> this may be a race condition of some kind in the isolation tester. It >> seems to have failed to recognize that a1 started waiting, and that >> caused the "deadlock detected" message to reported differently. I'm >> not immediately sure what to do about that. > Yeah, so: try_complete_step() waits 10ms, and if it still hasn't > gotten any data back from the server, then it uses a separate query to > see whether the step in question is waiting on a lock. So what > must've happened here is that it took more than 10ms for the process > to show up as waiting in pg_stat_activity. No, because the machines that are failing are showing a "<waiting ...>" annotation that your reference output *doesn't* have. I think what is actually happening is that these machines are seeing the process as waiting and reporting it, whereas on your machine the backend detects the deadlock and completes the query (with an error) before isolationtester realizes that the process is waiting. It would probably help if you didn't do this: setup { BEGIN; SET deadlock_timeout = '10ms'; } which pretty much guarantees that there is a race condition: you've set it so that the deadlock detector will run at approximately the same time when isolationtester will be probing the state. I'm surprised that it seemed to act consistently for you. I would suggest putting all the other sessions to deadlock_timeout of 100s and the one you want to fail to timeout of ~ 5s. That will mean that the "<waiting ...>" output should show up pretty reliably even on overloaded buildfarm critters. regards, tom lane
I wrote: > No, because the machines that are failing are showing a "<waiting ...>" > annotation that your reference output *doesn't* have. I think what is > actually happening is that these machines are seeing the process as > waiting and reporting it, whereas on your machine the backend detects > the deadlock and completes the query (with an error) before > isolationtester realizes that the process is waiting. I confirmed this theory by the expedient of changing the '10ms' setting in the test script to 1ms (which worked) and 100ms (which did not, on the same machine). I've committed an update that adjusts the timeouts to hopefully ensure that isolationtester always sees the query as blocked before the deadlock detector unblocks it; which seems like the behavior we want to test for, anyway. regards, tom lane
On Thu, Feb 11, 2016 at 12:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: >> No, because the machines that are failing are showing a "<waiting ...>" >> annotation that your reference output *doesn't* have. I think what is >> actually happening is that these machines are seeing the process as >> waiting and reporting it, whereas on your machine the backend detects >> the deadlock and completes the query (with an error) before >> isolationtester realizes that the process is waiting. > > I confirmed this theory by the expedient of changing the '10ms' setting > in the test script to 1ms (which worked) and 100ms (which did not, on > the same machine). > > I've committed an update that adjusts the timeouts to hopefully ensure > that isolationtester always sees the query as blocked before the deadlock > detector unblocks it; which seems like the behavior we want to test for, > anyway. Thanks. I really appreciate it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
We're not out of the woods on this :-( ... jaguarundi, which is the first of the CLOBBER_CACHE_ALWAYS animals to run these tests, didn't like them at all. I think I fixed the deadlock-soft-2 failure, but its take on deadlock-hard is: *** 17,25 **** step s6a7: LOCK TABLE a7; <waiting ...> step s7a8: LOCK TABLE a8; <waiting ...> step s8a1: LOCK TABLE a1;<waiting ...> - step s8a1: <... completed> step s7a8: <... completed> ! error in steps s8a1 s7a8: ERROR: deadlock detected step s8c: COMMIT; step s7c: COMMIT; step s6a7: <... completed> --- 17,25 ---- step s6a7: LOCK TABLE a7; <waiting ...> step s7a8: LOCK TABLE a8; <waiting ...> step s8a1: LOCK TABLE a1;<waiting ...> step s7a8: <... completed> ! step s8a1: <... completed> ! ERROR: deadlock detected step s8c: COMMIT; step s7c: COMMIT; step s6a7: <... completed> The problem here is that when the deadlock detector kills s8's transaction, s7a8 is also left free to proceed, so there is a race condition as to which query completion will get back to isolationtester first. One grotty way to handle that would be something like -step "s7a8" { LOCK TABLE a8; } +step "s7a8" { LOCK TABLE a8; SELECT pg_sleep(5); } Or we could simplify the locking structure enough so that no other transactions are released by the deadlock failure. I do not know exactly what you had in mind to be testing here? regards, tom lane
On Thu, Feb 11, 2016 at 11:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > We're not out of the woods on this :-( ... jaguarundi, which is the first > of the CLOBBER_CACHE_ALWAYS animals to run these tests, didn't like them > at all. I think I fixed the deadlock-soft-2 failure, but its take on > deadlock-hard is: > > *** 17,25 **** > step s6a7: LOCK TABLE a7; <waiting ...> > step s7a8: LOCK TABLE a8; <waiting ...> > step s8a1: LOCK TABLE a1; <waiting ...> > - step s8a1: <... completed> > step s7a8: <... completed> > ! error in steps s8a1 s7a8: ERROR: deadlock detected > step s8c: COMMIT; > step s7c: COMMIT; > step s6a7: <... completed> > --- 17,25 ---- > step s6a7: LOCK TABLE a7; <waiting ...> > step s7a8: LOCK TABLE a8; <waiting ...> > step s8a1: LOCK TABLE a1; <waiting ...> > step s7a8: <... completed> > ! step s8a1: <... completed> > ! ERROR: deadlock detected > step s8c: COMMIT; > step s7c: COMMIT; > step s6a7: <... completed> > > The problem here is that when the deadlock detector kills s8's > transaction, s7a8 is also left free to proceed, so there is a race > condition as to which query completion will get back to > isolationtester first. > > One grotty way to handle that would be something like > > -step "s7a8" { LOCK TABLE a8; } > +step "s7a8" { LOCK TABLE a8; SELECT pg_sleep(5); } > > Or we could simplify the locking structure enough so that no other > transactions are released by the deadlock failure. I do not know > exactly what you had in mind to be testing here? Basically just that the deadlock actually got detected. That may sound a bit weak, but considering we had no test for it at all before this... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Feb 11, 2016 at 11:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The problem here is that when the deadlock detector kills s8's >> transaction, s7a8 is also left free to proceed, so there is a race >> condition as to which query completion will get back to >> isolationtester first. >> >> One grotty way to handle that would be something like >> >> -step "s7a8" { LOCK TABLE a8; } >> +step "s7a8" { LOCK TABLE a8; SELECT pg_sleep(5); } >> >> Or we could simplify the locking structure enough so that no other >> transactions are released by the deadlock failure. I do not know >> exactly what you had in mind to be testing here? > Basically just that the deadlock actually got detected. That may > sound a bit weak, but considering we had no test for it at all before > this... I tried fixing it as shown above, and was dismayed to find out that it didn't work, ie, there was still a difference between the regular output and the results with CLOBBER_CACHE_ALWAYS. In the latter case the printout makes it appear that s7a8 completed before s8a1, which is nonsensical. Investigation showed that there are a couple of reasons. One, isolationtester's is-it-waiting query takes an insane amount of time under CLOBBER_CACHE_ALWAYS --- over half a second on my reasonably new server. Probing the state of half a dozen blocked sessions thus takes a while. Second, once s8 has been booted out of its transaction, s7 is no longer "blocked" according to isolationtester's definition (it's doing the pg_sleep query instead). Therefore, when we're rechecking all the other blocked steps after detecting that s8 has become blocked, two things happen: enough time elapses for the deadlock detector to fire, and then when we get around to checking s7, we don't see it as blocked and therefore wait until it finishes. So s7a8 is reported first despite the pg_sleep, and would be no matter large a pg_sleep delay is used. We could possibly fix this by using a deadlock timeout even higher than 5 seconds, but that way madness lies. Instead, what I propose we do about this is to change isolationtester so that once it's decided that a given step is blocked, it no longer issues the is-it-waiting query for that step; it just assumes that the step should be treated as blocked. So all we need do for "backlogged" steps is check PQisBusy/PQconsumeInput. That both greatly reduces the number of is-it-waiting queries that are needed and avoids any flappy behavior of the answer. Comments? regards, tom lane
I wrote: > Instead, what I propose we do about this is to change isolationtester > so that once it's decided that a given step is blocked, it no longer > issues the is-it-waiting query for that step; it just assumes that the > step should be treated as blocked. So all we need do for "backlogged" > steps is check PQisBusy/PQconsumeInput. That both greatly reduces the > number of is-it-waiting queries that are needed and avoids any flappy > behavior of the answer. Hmm, that seemed to work fine here, but the buildfarm is not very happy with it, and on reflection I guess it's just moving the indeterminacy somewhere else. If we check for completion of a given step, and don't wait till it's either completed or known blocked, then we have a race condition that can change the order in which completions are reported. The fundamental problem I fear is that isolationtester is designed around the assumption that only its own actions (query issuances) change the state of the database. Trying to use it to test deadlock detection is problematic because deadlock-breaking changes the DB state asynchronously. I think what we have to do is revert that change and dumb down deadlock-hard until it produces stable results even on the CLOBBER critters. One thing that'd help is reducing the number of processes involved --- AFAICS, testing an 8-way deadlock is not really any more interesting than testing, say, 4-way, and that would halve the amount of time isolationtester spends figuring out the state. regards, tom lane
On Fri, Feb 12, 2016 at 4:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: >> Instead, what I propose we do about this is to change isolationtester >> so that once it's decided that a given step is blocked, it no longer >> issues the is-it-waiting query for that step; it just assumes that the >> step should be treated as blocked. So all we need do for "backlogged" >> steps is check PQisBusy/PQconsumeInput. That both greatly reduces the >> number of is-it-waiting queries that are needed and avoids any flappy >> behavior of the answer. > > Hmm, that seemed to work fine here, but the buildfarm is not very happy > with it, and on reflection I guess it's just moving the indeterminacy > somewhere else. If we check for completion of a given step, and don't > wait till it's either completed or known blocked, then we have a race > condition that can change the order in which completions are reported. > > The fundamental problem I fear is that isolationtester is designed around > the assumption that only its own actions (query issuances) change the > state of the database. Trying to use it to test deadlock detection is > problematic because deadlock-breaking changes the DB state asynchronously. > > I think what we have to do is revert that change and dumb down > deadlock-hard until it produces stable results even on the CLOBBER > critters. One thing that'd help is reducing the number of processes > involved --- AFAICS, testing an 8-way deadlock is not really any more > interesting than testing, say, 4-way, and that would halve the amount > of time isolationtester spends figuring out the state. Maybe we should introduce a way to declare whether a step is expected to wait or not. I thought about doing that, and the only reason I didn't is because I couldn't figure out a reasonable syntax. But, in many respects, that would actually be better than the current system of having isolationtester try to figure it out itself. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.
От
Andres Freund
Дата:
On 2016-02-12 13:16:54 -0500, Tom Lane wrote: > Investigation showed that there are a couple of reasons. One, > isolationtester's is-it-waiting query takes an insane amount of > time under CLOBBER_CACHE_ALWAYS --- over half a second on my > reasonably new server. I wonder if we shouldn't just expose a 'which pid is process X waiting for' API, implemented serverside. That's generally really useful, and looks like it's actually going to be less complicated than that query... And it's surely going to be faster. Andres
On Fri, Feb 12, 2016 at 6:22 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-02-12 13:16:54 -0500, Tom Lane wrote: >> Investigation showed that there are a couple of reasons. One, >> isolationtester's is-it-waiting query takes an insane amount of >> time under CLOBBER_CACHE_ALWAYS --- over half a second on my >> reasonably new server. > > I wonder if we shouldn't just expose a 'which pid is process X waiting > for' API, implemented serverside. That's generally really useful, and > looks like it's actually going to be less complicated than that > query... And it's surely going to be faster. If PID 12000 and PID 13000 hold AccessShareLock on relation foo, and PID 14000 awaits AccessExclusiveLock on that relation, what does the function return when 14000 is passed as an argument? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Feb 12, 2016 at 6:22 PM, Andres Freund <andres@anarazel.de> wrote: >> I wonder if we shouldn't just expose a 'which pid is process X waiting >> for' API, implemented serverside. That's generally really useful, and >> looks like it's actually going to be less complicated than that >> query... And it's surely going to be faster. > If PID 12000 and PID 13000 hold AccessShareLock on relation foo, and > PID 14000 awaits AccessExclusiveLock on that relation, what does the > function return when 14000 is passed as an argument? Yeah. In general, it's not that easy to say that A is waiting specifically on B --- there may be multiple blockers and/or multiple ways it could get released. isolationtester's query is not really correct IMO. It's okay as long as nothing else besides autovacuum is taking locks, but I wouldn't want to try to make it work in a general purpose environment. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Feb 12, 2016 at 4:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The fundamental problem I fear is that isolationtester is designed around >> the assumption that only its own actions (query issuances) change the >> state of the database. Trying to use it to test deadlock detection is >> problematic because deadlock-breaking changes the DB state asynchronously. > Maybe we should introduce a way to declare whether a step is expected > to wait or not. I thought about doing that, and the only reason I > didn't is because I couldn't figure out a reasonable syntax. But, in > many respects, that would actually be better than the current system > of having isolationtester try to figure it out itself. Meh. I'm not sure that would actually help anything. The problem is not so much with figuring out whether a step blocks, as knowing when it's expected to complete. And for sure I don't want to annotate the spec files to the point of saying "this action should cause these other steps to complete". Actually ... the thing we've been fighting over the past couple days is having to make step completion orders deterministic when in principle they aren't and don't need to be. Maybe the solution is something like the core tests' variant expected files, ugly though those are. regards, tom lane
<p dir="ltr">The tests worked fine on faster build animals, right? And the clobber animals are much much slower.... So itseems perfectly sensible that their deadlock timeout would just have to be much much higher to have the same behaviour.I see nothing wrong in just setting deadlock timeout to a minute or more on these machines. <p dir="ltr">The invariantis just that the deadlock timeout needs enough head room over the actual time the tester's queries take. If theynormally take a 1/10th of a second then why not just set the timeout to 10x however long they take on the clobber animals?<pdir="ltr">-- <br /> Greg
Greg Stark <stark@mit.edu> writes: > The tests worked fine on faster build animals, right? And the clobber > animals are much much slower.... So it seems perfectly sensible that their > deadlock timeout would just have to be much much higher to have the same > behaviour. I see nothing wrong in just setting deadlock timeout to a minute > or more on these machines. We don't have a way to make the isolation tests change behavior depending on how the backend is compiled. So the only actually available fix is to make that test take "a minute or more" for *everybody*. Aside from that, it's just disturbing that these tests aren't deterministic regardless of machine speed. We don't seem to have a way around that right now, but I wish we did. regards, tom lane
Andres Freund <andres@anarazel.de> writes: > I wonder if we shouldn't just expose a 'which pid is process X waiting > for' API, implemented serverside. That's generally really useful, and > looks like it's actually going to be less complicated than that > query... And it's surely going to be faster. Attached is a draft patch for a new function that reports the set of PIDs directly blocking a given PID (that is, holding or awaiting conflicting locks on the lockable object it's waiting for). I replaced isolationtester's pg_locks query with this, and found that it's about 9X faster in a normal build, and 3X faster with CLOBBER_CACHE_ALWAYS turned on. That would give us some nice headroom for the isolation tests with CLOBBER_CACHE_ALWAYS animals. (Note that in view of http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=markhor&dt=2016-02-14%2007%3A38%3A37 we still need to do *something* about the speed of the new deadlock-hard test; this patch could avoid the need to dumb down or slow down that test even further.) Not to be neglected also is that (I believe) this gives the right answer, whereas isolationtester's existing query is currently completely broken by parallel queries, and it doesn't understand non-conflicting lock modes either. (It did, at least partially, before commit 38f8bdcac4982215; I am not sure that taking out the mode checks was a good idea. But putting them back would make the query slower yet.) This is WIP, in part because I've written no user docs for the new function, and in part because I think people might want to bikeshed the API. What is here is: "pg_blocker_pids(integer) returns integer[]" returns a possibly-empty array of the PIDs of backend processes that block the backend with specified PID. You get an empty array, not an error, if the argument isn't a valid backend PID or that backend isn't waiting. In parallel query situations, the output includes PIDs that are blocking any PID in the given process's lock group, and what is reported is always the PID of the lock group leader for whichever process is kdoing the blocking. Also, in parallel query situations, the same PID might appear multiple times in the output; it didn't seem worth trying to eliminate duplicates. Reasonable API change requests might include returning a rowset rather than an array and returning more data per lock conflict. I've not bothered with either thing here because isolationtester doesn't care and it would make the query somewhat slower for isolationtester's usage (though, probably, not enough slower to really matter). It should also be noted that I've not really tested the parallel query aspects of this, because I'm not sure how to create a conflicting lock request in a parallel worker. However, if I'm not still misunderstanding the new semantics of the lock data structures, that aspect of it seems unlikely to be wrong. Comments? regards, tom lane diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c index 91218d0..97e8962 100644 *** a/src/backend/storage/ipc/procarray.c --- b/src/backend/storage/ipc/procarray.c *************** HaveVirtualXIDsDelayingChkpt(VirtualTran *** 2313,2318 **** --- 2313,2341 ---- PGPROC * BackendPidGetProc(int pid) { + PGPROC *result; + + if (pid == 0) /* never match dummy PGPROCs */ + return NULL; + + LWLockAcquire(ProcArrayLock, LW_SHARED); + + result = BackendPidGetProcWithLock(pid); + + LWLockRelease(ProcArrayLock); + + return result; + } + + /* + * BackendPidGetProcWithLock -- get a backend's PGPROC given its PID + * + * Same as above, except caller must be holding ProcArrayLock. The found + * entry, if any, can be assumed to be valid as long as the lock remains held. + */ + PGPROC * + BackendPidGetProcWithLock(int pid) + { PGPROC *result = NULL; ProcArrayStruct *arrayP = procArray; int index; *************** BackendPidGetProc(int pid) *** 2320,2327 **** if (pid == 0) /* never match dummy PGPROCs */ return NULL; - LWLockAcquire(ProcArrayLock, LW_SHARED); - for (index = 0; index < arrayP->numProcs; index++) { PGPROC *proc = &allProcs[arrayP->pgprocnos[index]]; --- 2343,2348 ---- *************** BackendPidGetProc(int pid) *** 2333,2340 **** } } - LWLockRelease(ProcArrayLock); - return result; } --- 2354,2359 ---- diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c index fef59a2..fb32769 100644 *** a/src/backend/storage/lmgr/lock.c --- b/src/backend/storage/lmgr/lock.c *************** *** 21,27 **** * * Interface: * ! * InitLocks(), GetLocksMethodTable(), * LockAcquire(), LockRelease(), LockReleaseAll(), * LockCheckConflicts(), GrantLock() * --- 21,27 ---- * * Interface: * ! * InitLocks(), GetLocksMethodTable(), GetLockTagsMethodTable(), * LockAcquire(), LockRelease(), LockReleaseAll(), * LockCheckConflicts(), GrantLock() * *************** *** 41,46 **** --- 41,47 ---- #include "pg_trace.h" #include "pgstat.h" #include "storage/proc.h" + #include "storage/procarray.h" #include "storage/sinvaladt.h" #include "storage/spin.h" #include "storage/standby.h" *************** static void CleanUpLock(LOCK *lock, PROC *** 356,361 **** --- 357,364 ---- static void LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc, LOCKTAG *locktag, LOCKMODE lockmode, bool decrement_strong_lock_count); + static void GetSingleProcBlockerStatusData(PGPROC *blocked_proc, + BlockedProcsData *data); /* *************** GetLocksMethodTable(const LOCK *lock) *** 462,467 **** --- 465,482 ---- return LockMethods[lockmethodid]; } + /* + * Fetch the lock method table associated with a given locktag + */ + LockMethod + GetLockTagsMethodTable(const LOCKTAG *locktag) + { + LOCKMETHODID lockmethodid = (LOCKMETHODID) locktag->locktag_lockmethodid; + + Assert(0 < lockmethodid && lockmethodid < lengthof(LockMethods)); + return LockMethods[lockmethodid]; + } + /* * Compute the hash code associated with a LOCKTAG. *************** GetLockStatusData(void) *** 3406,3412 **** * impractical (in particular, note MAX_SIMUL_LWLOCKS). It shouldn't * matter too much, because none of these locks can be involved in lock * conflicts anyway - anything that might must be present in the main lock ! * table. */ for (i = 0; i < ProcGlobal->allProcCount; ++i) { --- 3421,3430 ---- * impractical (in particular, note MAX_SIMUL_LWLOCKS). It shouldn't * matter too much, because none of these locks can be involved in lock * conflicts anyway - anything that might must be present in the main lock ! * table. (For the same reason, we don't sweat about making leaderPid ! * completely valid. We cannot safely dereference another backend's ! * lockGroupLeader field without holding all lock partition locks, and ! * it's not worth that.) */ for (i = 0; i < ProcGlobal->allProcCount; ++i) { *************** GetLockStatusData(void) *** 3439,3444 **** --- 3457,3463 ---- instance->backend = proc->backendId; instance->lxid = proc->lxid; instance->pid = proc->pid; + instance->leaderPid = proc->pid; instance->fastpath = true; el++; *************** GetLockStatusData(void) *** 3466,3471 **** --- 3485,3491 ---- instance->backend = proc->backendId; instance->lxid = proc->lxid; instance->pid = proc->pid; + instance->leaderPid = proc->pid; instance->fastpath = true; el++; *************** GetLockStatusData(void) *** 3517,3522 **** --- 3537,3543 ---- instance->backend = proc->backendId; instance->lxid = proc->lxid; instance->pid = proc->pid; + instance->leaderPid = proclock->groupLeader->pid; instance->fastpath = false; el++; *************** GetLockStatusData(void) *** 3538,3543 **** --- 3559,3754 ---- } /* + * GetBlockerStatusData - Return a summary of the lock manager's state + * concerning locks that are blocking the specified PID or any member of + * the PID's lock group, for use in a user-level reporting function. + * + * For each PID within the lock group that is awaiting some heavyweight lock, + * the return data includes an array of LockInstanceData objects, which are + * the same data structure used by GetLockStatusData; but unlike that function, + * this one reports only the PROCLOCKs associated with the lock that that pid + * is blocked on. (Hence, all the locktags should be the same for any one + * blocked PID.) In addition, we return an array of the PIDs of those backends + * that are ahead of the blocked PID in the lock's wait queue. These can be + * compared with the pids in the LockInstanceData objects to determine which + * waiters are ahead of or behind the blocked PID in the queue. + * + * If blocked_pid isn't a valid backend PID or nothing in its lock group is + * waiting on any heavyweight lock, return empty arrays. + * + * The design goal is to hold the LWLocks for as short a time as possible; + * thus, this function simply makes a copy of the necessary data and releases + * the locks, allowing the caller to contemplate and format the data for as + * long as it pleases. + */ + BlockedProcsData * + GetBlockerStatusData(int blocked_pid) + { + BlockedProcsData *data; + PGPROC *proc; + int i; + + data = (BlockedProcsData *) palloc(sizeof(BlockedProcsData)); + + /* + * Guess how much space we'll need, and preallocate. Most of the time + * this will avoid needing to do repalloc while holding the LWLocks. (We + * assume, but check with an Assert, that MaxBackends is enough entries + * for the procs[] array; the other two could need enlargement, though.) + */ + data->nprocs = data->nlocks = data->npids = 0; + data->maxprocs = data->maxlocks = data->maxpids = MaxBackends; + data->procs = (BlockedProcData *) palloc(sizeof(BlockedProcData) * data->maxprocs); + data->locks = (LockInstanceData *) palloc(sizeof(LockInstanceData) * data->maxlocks); + data->waiter_pids = (int *) palloc(sizeof(int) * data->maxpids); + + /* + * In order to search the ProcArray for blocked_pid and assume that that + * entry won't immediately disappear under us, we must hold ProcArrayLock. + * In addition, to examine the lock grouping fields of any other backend, + * we must hold all the hash partition locks. (Only one of those locks is + * actually relevant for any one lock group, but we can't know which one + * ahead of time.) It's fairly annoying to hold all those locks + * throughout this, but it's no worse than GetLockStatusData(), and it + * does have the advantage that we're guaranteed to return a + * self-consistent instantaneous state. + */ + LWLockAcquire(ProcArrayLock, LW_SHARED); + + proc = BackendPidGetProcWithLock(blocked_pid); + + /* Nothing to do if it's gone */ + if (proc != NULL) + { + /* + * Acquire lock on the entire shared lock data structure. See notes + * in GetLockStatusData(). + */ + for (i = 0; i < NUM_LOCK_PARTITIONS; i++) + LWLockAcquire(LockHashPartitionLockByIndex(i), LW_SHARED); + + if (proc->lockGroupLeader == NULL) + { + /* Easy case, proc is not a lock group member */ + GetSingleProcBlockerStatusData(proc, data); + } + else + { + /* Examine all procs in proc's lock group */ + dlist_iter iter; + + dlist_foreach(iter, &proc->lockGroupLeader->lockGroupMembers) + { + PGPROC *memberProc; + + memberProc = dlist_container(PGPROC, lockGroupLink, iter.cur); + GetSingleProcBlockerStatusData(memberProc, data); + } + } + + /* + * And release locks. See notes in GetLockStatusData(). + */ + for (i = NUM_LOCK_PARTITIONS; --i >= 0;) + LWLockRelease(LockHashPartitionLockByIndex(i)); + + Assert(data->nprocs <= data->maxprocs); + } + + LWLockRelease(ProcArrayLock); + + return data; + } + + /* Accumulate data about one possibly-blocked proc for GetBlockerStatusData */ + static void + GetSingleProcBlockerStatusData(PGPROC *blocked_proc, BlockedProcsData *data) + { + LOCK *theLock = blocked_proc->waitLock; + BlockedProcData *bproc; + SHM_QUEUE *procLocks; + PROCLOCK *proclock; + PROC_QUEUE *waitQueue; + PGPROC *proc; + int queue_size; + int i; + + /* Nothing to do if this proc is not blocked */ + if (theLock == NULL) + return; + + /* Set up a procs[] element */ + bproc = &data->procs[data->nprocs++]; + bproc->pid = blocked_proc->pid; + bproc->first_lock = data->nlocks; + bproc->first_waiter = data->npids; + + /* + * We may ignore the proc's fast-path arrays, since nothing in those could + * be related to a contended lock. + */ + + /* Scan all PROCLOCKs associated with theLock. */ + procLocks = &(theLock->procLocks); + proclock = (PROCLOCK *) SHMQueueNext(procLocks, procLocks, + offsetof(PROCLOCK, lockLink)); + while (proclock) + { + PGPROC *proc = proclock->tag.myProc; + LOCK *lock = proclock->tag.myLock; + LockInstanceData *instance; + + if (data->nlocks >= data->maxlocks) + { + data->maxlocks += MaxBackends; + data->locks = (LockInstanceData *) + repalloc(data->locks, sizeof(LockInstanceData) * data->maxlocks); + } + + instance = &data->locks[data->nlocks]; + memcpy(&instance->locktag, &lock->tag, sizeof(LOCKTAG)); + instance->holdMask = proclock->holdMask; + if (proc->waitLock == proclock->tag.myLock) + instance->waitLockMode = proc->waitLockMode; + else + instance->waitLockMode = NoLock; + instance->backend = proc->backendId; + instance->lxid = proc->lxid; + instance->pid = proc->pid; + instance->leaderPid = proclock->groupLeader->pid; + instance->fastpath = false; + data->nlocks++; + + proclock = (PROCLOCK *) SHMQueueNext(procLocks, &proclock->lockLink, + offsetof(PROCLOCK, lockLink)); + } + + /* Now scan the lock's wait queue, stopping at blocked_proc */ + waitQueue = &(theLock->waitProcs); + queue_size = waitQueue->size; + + if (queue_size > data->maxpids - data->npids) + { + data->maxpids = Max(data->maxpids + MaxBackends, + data->npids + queue_size); + data->waiter_pids = (int *) repalloc(data->waiter_pids, + sizeof(int) * data->maxpids); + } + + proc = (PGPROC *) waitQueue->links.next; + for (i = 0; i < queue_size; i++) + { + if (proc == blocked_proc) + break; + data->waiter_pids[data->npids++] = proc->pid; + proc = (PGPROC *) proc->links.next; + } + + bproc->num_locks = data->nlocks - bproc->first_lock; + bproc->num_waiters = data->npids - bproc->first_waiter; + } + + /* * Returns a list of currently held AccessExclusiveLocks, for use by * LogStandbySnapshot(). The result is a palloc'd array, * with the number of elements returned into *nlocks. diff --git a/src/backend/utils/adt/lockfuncs.c b/src/backend/utils/adt/lockfuncs.c index 73c78e9..ad9072d 100644 *** a/src/backend/utils/adt/lockfuncs.c --- b/src/backend/utils/adt/lockfuncs.c *************** *** 18,23 **** --- 18,24 ---- #include "funcapi.h" #include "miscadmin.h" #include "storage/predicate_internals.h" + #include "utils/array.h" #include "utils/builtins.h" *************** pg_lock_status(PG_FUNCTION_ARGS) *** 99,105 **** oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx); /* build tupdesc for result tuples */ ! /* this had better match pg_locks view in system_views.sql */ tupdesc = CreateTemplateTupleDesc(NUM_LOCK_STATUS_COLUMNS, false); TupleDescInitEntry(tupdesc, (AttrNumber) 1, "locktype", TEXTOID, -1, 0); --- 100,106 ---- oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx); /* build tupdesc for result tuples */ ! /* this had better match function's declaration in pg_proc.h */ tupdesc = CreateTemplateTupleDesc(NUM_LOCK_STATUS_COLUMNS, false); TupleDescInitEntry(tupdesc, (AttrNumber) 1, "locktype", TEXTOID, -1, 0); *************** pg_lock_status(PG_FUNCTION_ARGS) *** 395,400 **** --- 396,511 ---- /* + * pg_blocker_pids - produce an array of the PIDs blocking given PID + * + * In parallel-query cases, we report all PIDs blocking any member of the + * given PID's lock group, and the reported PIDs are always those of lock + * group leaders. We do not bother eliminating duplicates from the result. + * + * We need not consider predicate locks here, since those don't block anything. + */ + Datum + pg_blocker_pids(PG_FUNCTION_ARGS) + { + int blocked_pid = PG_GETARG_INT32(0); + Datum *arrayelems; + int narrayelems; + BlockedProcsData *lockData; /* state data from lmgr */ + int i, + j; + + /* Collect a snapshot of lock manager state */ + lockData = GetBlockerStatusData(blocked_pid); + + /* We can't need more output entries than there are reported PROCLOCKs */ + arrayelems = (Datum *) palloc(lockData->nlocks * sizeof(Datum)); + narrayelems = 0; + + /* For each blocked proc in the lock group ... */ + for (i = 0; i < lockData->nprocs; i++) + { + BlockedProcData *bproc = &lockData->procs[i]; + LockInstanceData *instances = &lockData->locks[bproc->first_lock]; + int *preceding_waiters = &lockData->waiter_pids[bproc->first_waiter]; + LockInstanceData *blocked_instance; + LockMethod lockMethodTable; + int conflictMask; + + /* + * Locate the blocked proc's own entry in the LockInstanceData array. + * There should be exactly one matching entry. + */ + blocked_instance = NULL; + for (j = 0; j < bproc->num_locks; j++) + { + LockInstanceData *instance = &(instances[j]); + + if (instance->pid == bproc->pid) + { + Assert(blocked_instance == NULL); + blocked_instance = instance; + } + } + Assert(blocked_instance != NULL); + + lockMethodTable = GetLockTagsMethodTable(&(blocked_instance->locktag)); + conflictMask = lockMethodTable->conflictTab[blocked_instance->waitLockMode]; + + /* Now scan the PROCLOCK data for conflicting procs */ + for (j = 0; j < bproc->num_locks; j++) + { + LockInstanceData *instance = &(instances[j]); + + /* A proc never blocks itself, so ignore that entry */ + if (instance == blocked_instance) + continue; + /* Members of same lock group never block each other, either */ + if (instance->leaderPid == blocked_instance->leaderPid) + continue; + + if (conflictMask & instance->holdMask) + /* blocked by lock already held by this entry */ ; + else if (instance->waitLockMode != NoLock && + (conflictMask & LOCKBIT_ON(instance->waitLockMode))) + { + /* conflict with lock requested by this entry; who's in front? */ + bool ahead = false; + int k; + + for (k = 0; k < bproc->num_waiters; k++) + { + if (preceding_waiters[k] == instance->pid) + { + /* this entry is waiting ahead of blocked proc */ + ahead = true; + break; + } + } + if (!ahead) + continue; /* not blocked by this entry */ + } + else + { + /* not blocked by this entry */ + continue; + } + + /* blocked by this entry, so emit a record */ + arrayelems[narrayelems++] = Int32GetDatum(instance->leaderPid); + } + } + + /* Assert we didn't overrun arrayelems[] */ + Assert(narrayelems <= lockData->nlocks); + + /* Construct array, using hardwired knowledge about int4 type */ + PG_RETURN_ARRAYTYPE_P(construct_array(arrayelems, narrayelems, + INT4OID, + sizeof(int32), true, 'i')); + } + + + /* * Functions for manipulating advisory locks * * We make use of the locktag fields as follows: diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h index 1c0ef9a..90a02b3 100644 *** a/src/include/catalog/pg_proc.h --- b/src/include/catalog/pg_proc.h *************** DATA(insert OID = 3329 ( pg_show_all_fi *** 3012,3017 **** --- 3012,3019 ---- DESCR("show config file settings"); DATA(insert OID = 1371 ( pg_lock_status PGNSP PGUID 12 1 1000 0 0 f f f f t t v s 0 0 2249 "" "{25,26,26,23,21,25,28,26,26,21,25,23,25,16,16}""{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{locktype,database,relation,page,tuple,virtualxid,transactionid,classid,objid,objsubid,virtualtransaction,pid,mode,granted,fastpath}" _null__null_ pg_lock_status _null_ _null_ _null_ )); DESCR("view system lock information"); + DATA(insert OID = 2561 ( pg_blocker_pids PGNSP PGUID 12 1 0 0 0 f f f f t f v s 1 0 1007 "23" _null_ _null_ _null_ _null__null_ pg_blocker_pids _null_ _null_ _null_ )); + DESCR("report PIDs of blockers of specified backend PID, as array"); DATA(insert OID = 1065 ( pg_prepared_xact PGNSP PGUID 12 1 1000 0 0 f f f f t t v s 0 0 2249 "" "{28,25,1184,26,26}" "{o,o,o,o,o}""{transaction,gid,prepared,ownerid,dbid}" _null_ _null_ pg_prepared_xact _null_ _null_ _null_ )); DESCR("view two-phase transactions"); DATA(insert OID = 3819 ( pg_get_multixact_members PGNSP PGUID 12 1 1000 0 0 f f f f t t v s 1 0 2249 "28" "{28,28,25}""{i,o,o}" "{multixid,xid,mode}" _null_ _null_ pg_get_multixact_members _null_ _null_ _null_ )); diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h index 6b4e365..2ab5aa9 100644 *** a/src/include/storage/lock.h --- b/src/include/storage/lock.h *************** typedef struct PROCLOCK *** 346,352 **** PROCLOCKTAG tag; /* unique identifier of proclock object */ /* data */ ! PGPROC *groupLeader; /* group leader, or NULL if no lock group */ LOCKMASK holdMask; /* bitmask for lock types currently held */ LOCKMASK releaseMask; /* bitmask for lock types to be released */ SHM_QUEUE lockLink; /* list link in LOCK's list of proclocks */ --- 346,352 ---- PROCLOCKTAG tag; /* unique identifier of proclock object */ /* data */ ! PGPROC *groupLeader; /* proc's lock group leader, or proc itself */ LOCKMASK holdMask; /* bitmask for lock types currently held */ LOCKMASK releaseMask; /* bitmask for lock types to be released */ SHM_QUEUE lockLink; /* list link in LOCK's list of proclocks */ *************** typedef struct LOCALLOCK *** 423,443 **** typedef struct LockInstanceData { ! LOCKTAG locktag; /* locked object */ LOCKMASK holdMask; /* locks held by this PGPROC */ LOCKMODE waitLockMode; /* lock awaited by this PGPROC, if any */ BackendId backend; /* backend ID of this PGPROC */ LocalTransactionId lxid; /* local transaction ID of this PGPROC */ int pid; /* pid of this PGPROC */ bool fastpath; /* taken via fastpath? */ } LockInstanceData; typedef struct LockData { int nelements; /* The length of the array */ ! LockInstanceData *locks; } LockData; /* Result codes for LockAcquire() */ typedef enum --- 423,470 ---- typedef struct LockInstanceData { ! LOCKTAG locktag; /* tag for locked object */ LOCKMASK holdMask; /* locks held by this PGPROC */ LOCKMODE waitLockMode; /* lock awaited by this PGPROC, if any */ BackendId backend; /* backend ID of this PGPROC */ LocalTransactionId lxid; /* local transaction ID of this PGPROC */ int pid; /* pid of this PGPROC */ + int leaderPid; /* pid of group leader; = pid if no group */ bool fastpath; /* taken via fastpath? */ } LockInstanceData; typedef struct LockData { int nelements; /* The length of the array */ ! LockInstanceData *locks; /* Array of per-PROCLOCK information */ } LockData; + typedef struct BlockedProcData + { + int pid; /* pid of a blocked PGPROC */ + /* Per-PROCLOCK information about PROCLOCKs of the lock the pid awaits */ + /* (these fields refer to indexes in BlockedProcsData.locks[]) */ + int first_lock; /* index of first relevant LockInstanceData */ + int num_locks; /* number of relevant LockInstanceDatas */ + /* PIDs of PGPROCs that are ahead of "pid" in the lock's wait queue */ + /* (these fields refer to indexes in BlockedProcsData.waiter_pids[]) */ + int first_waiter; /* index of first preceding waiter */ + int num_waiters; /* number of preceding waiters */ + } BlockedProcData; + + typedef struct BlockedProcsData + { + BlockedProcData *procs; /* Array of per-blocked-proc information */ + LockInstanceData *locks; /* Array of per-PROCLOCK information */ + int *waiter_pids; /* Array of PIDs of other blocked PGPROCs */ + int nprocs; /* # of valid entries in procs[] array */ + int maxprocs; /* Allocated length of procs[] array */ + int nlocks; /* # of valid entries in locks[] array */ + int maxlocks; /* Allocated length of locks[] array */ + int npids; /* # of valid entries in waiter_pids[] array */ + int maxpids; /* Allocated length of waiter_pids[] array */ + } BlockedProcsData; + /* Result codes for LockAcquire() */ typedef enum *************** typedef enum *** 488,493 **** --- 515,521 ---- */ extern void InitLocks(void); extern LockMethod GetLocksMethodTable(const LOCK *lock); + extern LockMethod GetLockTagsMethodTable(const LOCKTAG *locktag); extern uint32 LockTagHashCode(const LOCKTAG *locktag); extern bool DoLockModesConflict(LOCKMODE mode1, LOCKMODE mode2); extern LockAcquireResult LockAcquire(const LOCKTAG *locktag, *************** extern void GrantAwaitedLock(void); *** 520,525 **** --- 548,554 ---- extern void RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode); extern Size LockShmemSize(void); extern LockData *GetLockStatusData(void); + extern BlockedProcsData *GetBlockerStatusData(int blocked_pid); extern xl_standby_lock *GetRunningTransactionLocks(int *nlocks); extern const char *GetLockmodeName(LOCKMETHODID lockmethodid, LOCKMODE mode); diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h index 1fbf4f3..dd37c0c 100644 *** a/src/include/storage/procarray.h --- b/src/include/storage/procarray.h *************** extern VirtualTransactionId *GetVirtualX *** 61,66 **** --- 61,67 ---- extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids); extern PGPROC *BackendPidGetProc(int pid); + extern PGPROC *BackendPidGetProcWithLock(int pid); extern int BackendXidGetPid(TransactionId xid); extern bool IsBackendPid(int pid); diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h index affcc01..4848cbd 100644 *** a/src/include/utils/builtins.h --- b/src/include/utils/builtins.h *************** extern Datum row_security_active_name(PG *** 1153,1158 **** --- 1153,1159 ---- /* lockfuncs.c */ extern Datum pg_lock_status(PG_FUNCTION_ARGS); + extern Datum pg_blocker_pids(PG_FUNCTION_ARGS); extern Datum pg_advisory_lock_int8(PG_FUNCTION_ARGS); extern Datum pg_advisory_xact_lock_int8(PG_FUNCTION_ARGS); extern Datum pg_advisory_lock_shared_int8(PG_FUNCTION_ARGS); diff --git a/src/test/isolation/isolationtester.c b/src/test/isolation/isolationtester.c index 0a9d25c..a85d24e 100644 *** a/src/test/isolation/isolationtester.c --- b/src/test/isolation/isolationtester.c *************** main(int argc, char **argv) *** 227,253 **** */ initPQExpBuffer(&wait_query); appendPQExpBufferStr(&wait_query, ! "SELECT 1 FROM pg_locks holder, pg_locks waiter " ! "WHERE NOT waiter.granted AND waiter.pid = $1 " ! "AND holder.granted " ! "AND holder.pid <> $1 AND holder.pid IN ("); /* The spec syntax requires at least one session; assume that here. */ appendPQExpBufferStr(&wait_query, backend_pids[1]); for (i = 2; i < nconns; i++) ! appendPQExpBuffer(&wait_query, ", %s", backend_pids[i]); ! appendPQExpBufferStr(&wait_query, ! ") " ! ! "AND holder.locktype IS NOT DISTINCT FROM waiter.locktype " ! "AND holder.database IS NOT DISTINCT FROM waiter.database " ! "AND holder.relation IS NOT DISTINCT FROM waiter.relation " ! "AND holder.page IS NOT DISTINCT FROM waiter.page " ! "AND holder.tuple IS NOT DISTINCT FROM waiter.tuple " ! "AND holder.virtualxid IS NOT DISTINCT FROM waiter.virtualxid " ! "AND holder.transactionid IS NOT DISTINCT FROM waiter.transactionid " ! "AND holder.classid IS NOT DISTINCT FROM waiter.classid " ! "AND holder.objid IS NOT DISTINCT FROM waiter.objid " ! "AND holder.objsubid IS NOT DISTINCT FROM waiter.objsubid "); res = PQprepare(conns[0], PREP_WAITING, wait_query.data, 0, NULL); if (PQresultStatus(res) != PGRES_COMMAND_OK) --- 227,238 ---- */ initPQExpBuffer(&wait_query); appendPQExpBufferStr(&wait_query, ! "SELECT pg_catalog.pg_blocker_pids($1) && '{"); /* The spec syntax requires at least one session; assume that here. */ appendPQExpBufferStr(&wait_query, backend_pids[1]); for (i = 2; i < nconns; i++) ! appendPQExpBuffer(&wait_query, ",%s", backend_pids[i]); ! appendPQExpBufferStr(&wait_query, "}'::integer[]"); res = PQprepare(conns[0], PREP_WAITING, wait_query.data, 0, NULL); if (PQresultStatus(res) != PGRES_COMMAND_OK) *************** try_complete_step(Step *step, int flags) *** 745,765 **** /* If it's OK for the step to block, check whether it has. */ if (flags & STEP_NONBLOCK) { ! int ntuples; res = PQexecPrepared(conns[0], PREP_WAITING, 1, &backend_pids[step->session + 1], NULL, NULL, 0); ! if (PQresultStatus(res) != PGRES_TUPLES_OK) { fprintf(stderr, "lock wait query failed: %s", PQerrorMessage(conn)); exit_nicely(); } ! ntuples = PQntuples(res); PQclear(res); ! if (ntuples >= 1) /* waiting to acquire a lock */ { if (!(flags & STEP_RETRY)) printf("step %s: %s <waiting ...>\n", --- 730,751 ---- /* If it's OK for the step to block, check whether it has. */ if (flags & STEP_NONBLOCK) { ! bool waiting; res = PQexecPrepared(conns[0], PREP_WAITING, 1, &backend_pids[step->session + 1], NULL, NULL, 0); ! if (PQresultStatus(res) != PGRES_TUPLES_OK || ! PQntuples(res) != 1) { fprintf(stderr, "lock wait query failed: %s", PQerrorMessage(conn)); exit_nicely(); } ! waiting = ((PQgetvalue(res, 0, 0))[0] == 't'); PQclear(res); ! if (waiting) /* waiting to acquire a lock */ { if (!(flags & STEP_RETRY)) printf("step %s: %s <waiting ...>\n",
On Tue, Feb 16, 2016 at 2:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andres Freund <andres@anarazel.de> writes: >> I wonder if we shouldn't just expose a 'which pid is process X waiting >> for' API, implemented serverside. That's generally really useful, and >> looks like it's actually going to be less complicated than that >> query... And it's surely going to be faster. > > Attached is a draft patch for a new function that reports the set of PIDs > directly blocking a given PID (that is, holding or awaiting conflicting > locks on the lockable object it's waiting for). > > I replaced isolationtester's pg_locks query with this, and found that > it's about 9X faster in a normal build, and 3X faster with > CLOBBER_CACHE_ALWAYS turned on. That would give us some nice headroom > for the isolation tests with CLOBBER_CACHE_ALWAYS animals. (Note that > in view of > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=markhor&dt=2016-02-14%2007%3A38%3A37 > we still need to do *something* about the speed of the new deadlock-hard > test; this patch could avoid the need to dumb down or slow down that test > even further.) > > Not to be neglected also is that (I believe) this gives the right answer, > whereas isolationtester's existing query is currently completely broken by > parallel queries, and it doesn't understand non-conflicting lock modes > either. (It did, at least partially, before commit 38f8bdcac4982215; > I am not sure that taking out the mode checks was a good idea. But > putting them back would make the query slower yet.) The reason I took that out is because it breaks the deadlock-soft test. It's possible to have a situation where no granted lock conflicts with an awaited lock. If that happens, the old query wrongly concluded that the waiting process was not in fact waiting. (Consider A hold AccessShareLock, B awaits AccessExclusiveLock, C now requests AccessShareLock and *waits*.) As for the patch itself, I'm having trouble grokking what it's trying to do. I think it might be worth having a comment defining precisely what we mean by "A blocks B". I would define "A blocks B" in general as either A holds a lock which conflicts with one sought by B (hard-blocked) or A awaits a lock which conflicts with one sought by B and precedes it in the wait queue (soft-blocked). I have wondered before if we shouldn't modify pg_locks to expose the wait-queue ordering; without that, you can't reliably determine in general whether A soft-blocks B, which means every view anyone has ever written over pg_locks that purports to say who blocks who is necessarily buggy. For parallel queries, there's a further relevant distinction when we say "A blocks B". We might mean either that (1) process B cannot resume execution until the lock conflict is resolved or (2) the group leader for process B cannot complete the current parallel operation until the lock conflict is resolved. If you're trying to figure out why one particular member of a parallel group is stuck, you want to answer question #1. If you're trying to figure out what all the things that need to get out of the way to finish the query, you want to answer question #2. I think this function is aiming to answer question #2, not question #1, but I'm less clear on the reason behind that definitional choice. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Feb 16, 2016 at 2:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Not to be neglected also is that (I believe) this gives the right answer, >> whereas isolationtester's existing query is currently completely broken by >> parallel queries, and it doesn't understand non-conflicting lock modes >> either. (It did, at least partially, before commit 38f8bdcac4982215; >> I am not sure that taking out the mode checks was a good idea. But >> putting them back would make the query slower yet.) > The reason I took that out is because it breaks the deadlock-soft > test. It's possible to have a situation where no granted lock > conflicts with an awaited lock. If that happens, the old query > wrongly concluded that the waiting process was not in fact waiting. > (Consider A hold AccessShareLock, B awaits AccessExclusiveLock, C now > requests AccessShareLock and *waits*.) Ah, well, with this patch the deadlock-soft test still passes. > As for the patch itself, I'm having trouble grokking what it's trying > to do. I think it might be worth having a comment defining precisely > what we mean by "A blocks B". I would define "A blocks B" in general > as either A holds a lock which conflicts with one sought by B > (hard-blocked) or A awaits a lock which conflicts with one sought by B > and precedes it in the wait queue (soft-blocked). Yes, that is exactly what I implemented ... and it's something you can't find out from pg_locks. I'm not sure how that view could be made to expose wait-queue ordering. > For parallel queries, there's a further relevant distinction when we > say "A blocks B". We might mean either that (1) process B cannot > resume execution until the lock conflict is resolved or (2) the group > leader for process B cannot complete the current parallel operation > until the lock conflict is resolved. The definition I used in this patch is "some member of A's lock group blocks some member of B's lock group", because that corresponds directly to whether A is preventing B's query from completing, which is what isolationtester wants to know --- and, I would argue, it's generally what any client would want to know. 99.9% of clients would just as soon not be aware of parallel workers lurking underneath the pg_backend_pid() values that they see for their sessions. regards, tom lane
I just had a rather disturbing thought to the effect that this entire design --- ie, parallel workers taking out locks for themselves --- is fundamentally flawed. As far as I can tell from README.parallel, parallel workers are supposed to exit (and, presumably, release their locks) before the leader's transaction commits. Releasing locks before commit is wrong. Do I need to rehearse why? regards, tom lane
I wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> As for the patch itself, I'm having trouble grokking what it's trying >> to do. I think it might be worth having a comment defining precisely >> what we mean by "A blocks B". I would define "A blocks B" in general >> as either A holds a lock which conflicts with one sought by B >> (hard-blocked) or A awaits a lock which conflicts with one sought by B >> and precedes it in the wait queue (soft-blocked). > Yes, that is exactly what I implemented ... and it's something you can't > find out from pg_locks. I'm not sure how that view could be made to > expose wait-queue ordering. Here's an updated version of this patch, now with user-facing docs. I decided that "pg_blocking_pids()" is a better function name than "pg_blocker_pids()". The code's otherwise the same, although I revisited some of the comments. I also changed quite a few references to "transaction" into "process" in the discussion of pg_locks. The previous choice to conflate processes with transactions was never terribly wise in my view, and it's certainly completely broken by parallel query. regards, tom lane diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index d77e999..d3270e4 100644 *** a/doc/src/sgml/catalogs.sgml --- b/doc/src/sgml/catalogs.sgml *************** *** 8015,8030 **** <para> The view <structname>pg_locks</structname> provides access to ! information about the locks held by open transactions within the database server. See <xref linkend="mvcc"> for more discussion of locking. </para> <para> <structname>pg_locks</structname> contains one row per active lockable ! object, requested lock mode, and relevant transaction. Thus, the same lockable object might ! appear many times, if multiple transactions are holding or waiting for locks on it. However, an object that currently has no locks on it will not appear at all. </para> --- 8015,8030 ---- <para> The view <structname>pg_locks</structname> provides access to ! information about the locks held by active processes within the database server. See <xref linkend="mvcc"> for more discussion of locking. </para> <para> <structname>pg_locks</structname> contains one row per active lockable ! object, requested lock mode, and relevant process. Thus, the same lockable object might ! appear many times, if multiple processs are holding or waiting for locks on it. However, an object that currently has no locks on it will not appear at all. </para> *************** *** 8200,8210 **** <para> <structfield>granted</structfield> is true in a row representing a lock ! held by the indicated transaction. False indicates that this transaction is ! currently waiting to acquire this lock, which implies that some other ! transaction is holding a conflicting lock mode on the same lockable object. ! The waiting transaction will sleep until the other lock is released (or a ! deadlock situation is detected). A single transaction can be waiting to acquire at most one lock at a time. </para> --- 8200,8210 ---- <para> <structfield>granted</structfield> is true in a row representing a lock ! held by the indicated process. False indicates that this process is ! currently waiting to acquire this lock, which implies that at least one other ! process is holding a conflicting lock mode on the same lockable object. ! The waiting process will sleep until the other lock is released (or a ! deadlock situation is detected). A single process can be waiting to acquire at most one lock at a time. </para> *************** *** 8224,8230 **** Although tuples are a lockable type of object, information about row-level locks is stored on disk, not in memory, and therefore row-level locks normally do not appear in this view. ! If a transaction is waiting for a row-level lock, it will usually appear in the view as waiting for the permanent transaction ID of the current holder of that row lock. </para> --- 8224,8230 ---- Although tuples are a lockable type of object, information about row-level locks is stored on disk, not in memory, and therefore row-level locks normally do not appear in this view. ! If a process is waiting for a row-level lock, it will usually appear in the view as waiting for the permanent transaction ID of the current holder of that row lock. </para> *************** SELECT * FROM pg_locks pl LEFT JOIN pg_p *** 8281,8286 **** --- 8281,8300 ---- </para> <para> + While it is possible to obtain information about which processes block + which other processes by joining <structname>pg_locks</structname> against + itself, this is very difficult to get right in detail. Such a query would + have to encode knowledge about which lock modes conflict with which + others. Worse, the <structname>pg_locks</structname> view does not expose + information about which processes are ahead of which others in lock wait + queues, nor information about which processes are parallel workers running + on behalf of which other client sessions. It is better to use + the <function>pg_blocking_pids()</> function + (see <xref linkend="functions-info-session-table">) to identify which + process(es) a waiting process is blocked behind. + </para> + + <para> The <structname>pg_locks</structname> view displays data from both the regular lock manager and the predicate lock manager, which are separate systems; in addition, the regular lock manager subdivides its diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index b001ce5..2d9fe7f 100644 *** a/doc/src/sgml/func.sgml --- b/doc/src/sgml/func.sgml *************** SELECT * FROM pg_ls_dir('.') WITH ORDINA *** 14997,15002 **** --- 14997,15008 ---- </row> <row> + <entry><literal><function>pg_blocking_pids(<type>int</type>)</function></literal></entry> + <entry><type>int[]</type></entry> + <entry>Process ID(s) that are blocking specified server process ID</entry> + </row> + + <row> <entry><literal><function>pg_conf_load_time()</function></literal></entry> <entry><type>timestamp with time zone</type></entry> <entry>configuration load time</entry> *************** SET search_path TO <replaceable>schema</ *** 15184,15189 **** --- 15190,15215 ---- </para> <indexterm> + <primary>pg_blocking_pids</primary> + </indexterm> + + <para> + <function>pg_blocking_pids</function> returns an array of the process IDs + of the sessions that are blocking the server process with the specified + process ID, or an empty array if there is no such server process or it is + not blocked. One server process blocks another if it either holds a lock + that conflicts with the blocked process's lock request (hard block), or is + waiting for a lock that would conflict with the blocked process's lock + request and is ahead of it in the wait queue (soft block). When using + parallel queries the result always lists client-visible process IDs (that + is, <function>pg_backend_pid</> results) even if the actual lock is held + or awaited by a child worker process. As a result of that, there may be + duplicated PIDs in the result. Also note that when a prepared transaction + holds a conflicting lock, it will be represented by a zero process ID in + the result of this function. + </para> + + <indexterm> <primary>pg_conf_load_time</primary> </indexterm> diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c index 91218d0..97e8962 100644 *** a/src/backend/storage/ipc/procarray.c --- b/src/backend/storage/ipc/procarray.c *************** HaveVirtualXIDsDelayingChkpt(VirtualTran *** 2313,2318 **** --- 2313,2341 ---- PGPROC * BackendPidGetProc(int pid) { + PGPROC *result; + + if (pid == 0) /* never match dummy PGPROCs */ + return NULL; + + LWLockAcquire(ProcArrayLock, LW_SHARED); + + result = BackendPidGetProcWithLock(pid); + + LWLockRelease(ProcArrayLock); + + return result; + } + + /* + * BackendPidGetProcWithLock -- get a backend's PGPROC given its PID + * + * Same as above, except caller must be holding ProcArrayLock. The found + * entry, if any, can be assumed to be valid as long as the lock remains held. + */ + PGPROC * + BackendPidGetProcWithLock(int pid) + { PGPROC *result = NULL; ProcArrayStruct *arrayP = procArray; int index; *************** BackendPidGetProc(int pid) *** 2320,2327 **** if (pid == 0) /* never match dummy PGPROCs */ return NULL; - LWLockAcquire(ProcArrayLock, LW_SHARED); - for (index = 0; index < arrayP->numProcs; index++) { PGPROC *proc = &allProcs[arrayP->pgprocnos[index]]; --- 2343,2348 ---- *************** BackendPidGetProc(int pid) *** 2333,2340 **** } } - LWLockRelease(ProcArrayLock); - return result; } --- 2354,2359 ---- diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c index fef59a2..9eb4dfb 100644 *** a/src/backend/storage/lmgr/lock.c --- b/src/backend/storage/lmgr/lock.c *************** *** 21,27 **** * * Interface: * ! * InitLocks(), GetLocksMethodTable(), * LockAcquire(), LockRelease(), LockReleaseAll(), * LockCheckConflicts(), GrantLock() * --- 21,27 ---- * * Interface: * ! * InitLocks(), GetLocksMethodTable(), GetLockTagsMethodTable(), * LockAcquire(), LockRelease(), LockReleaseAll(), * LockCheckConflicts(), GrantLock() * *************** *** 41,46 **** --- 41,47 ---- #include "pg_trace.h" #include "pgstat.h" #include "storage/proc.h" + #include "storage/procarray.h" #include "storage/sinvaladt.h" #include "storage/spin.h" #include "storage/standby.h" *************** static void CleanUpLock(LOCK *lock, PROC *** 356,361 **** --- 357,364 ---- static void LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc, LOCKTAG *locktag, LOCKMODE lockmode, bool decrement_strong_lock_count); + static void GetSingleProcBlockerStatusData(PGPROC *blocked_proc, + BlockedProcsData *data); /* *************** GetLocksMethodTable(const LOCK *lock) *** 462,467 **** --- 465,482 ---- return LockMethods[lockmethodid]; } + /* + * Fetch the lock method table associated with a given locktag + */ + LockMethod + GetLockTagsMethodTable(const LOCKTAG *locktag) + { + LOCKMETHODID lockmethodid = (LOCKMETHODID) locktag->locktag_lockmethodid; + + Assert(0 < lockmethodid && lockmethodid < lengthof(LockMethods)); + return LockMethods[lockmethodid]; + } + /* * Compute the hash code associated with a LOCKTAG. *************** GetLockStatusData(void) *** 3406,3412 **** * impractical (in particular, note MAX_SIMUL_LWLOCKS). It shouldn't * matter too much, because none of these locks can be involved in lock * conflicts anyway - anything that might must be present in the main lock ! * table. */ for (i = 0; i < ProcGlobal->allProcCount; ++i) { --- 3421,3430 ---- * impractical (in particular, note MAX_SIMUL_LWLOCKS). It shouldn't * matter too much, because none of these locks can be involved in lock * conflicts anyway - anything that might must be present in the main lock ! * table. (For the same reason, we don't sweat about making leaderPid ! * completely valid. We cannot safely dereference another backend's ! * lockGroupLeader field without holding all lock partition locks, and ! * it's not worth that.) */ for (i = 0; i < ProcGlobal->allProcCount; ++i) { *************** GetLockStatusData(void) *** 3439,3444 **** --- 3457,3463 ---- instance->backend = proc->backendId; instance->lxid = proc->lxid; instance->pid = proc->pid; + instance->leaderPid = proc->pid; instance->fastpath = true; el++; *************** GetLockStatusData(void) *** 3466,3471 **** --- 3485,3491 ---- instance->backend = proc->backendId; instance->lxid = proc->lxid; instance->pid = proc->pid; + instance->leaderPid = proc->pid; instance->fastpath = true; el++; *************** GetLockStatusData(void) *** 3517,3522 **** --- 3537,3543 ---- instance->backend = proc->backendId; instance->lxid = proc->lxid; instance->pid = proc->pid; + instance->leaderPid = proclock->groupLeader->pid; instance->fastpath = false; el++; *************** GetLockStatusData(void) *** 3538,3543 **** --- 3559,3755 ---- } /* + * GetBlockerStatusData - Return a summary of the lock manager's state + * concerning locks that are blocking the specified PID or any member of + * the PID's lock group, for use in a user-level reporting function. + * + * For each PID within the lock group that is awaiting some heavyweight lock, + * the return data includes an array of LockInstanceData objects, which are + * the same data structure used by GetLockStatusData; but unlike that function, + * this one reports only the PROCLOCKs associated with the lock that that pid + * is blocked on. (Hence, all the locktags should be the same for any one + * blocked PID.) In addition, we return an array of the PIDs of those backends + * that are ahead of the blocked PID in the lock's wait queue. These can be + * compared with the pids in the LockInstanceData objects to determine which + * waiters are ahead of or behind the blocked PID in the queue. + * + * If blocked_pid isn't a valid backend PID or nothing in its lock group is + * waiting on any heavyweight lock, return empty arrays. + * + * The design goal is to hold the LWLocks for as short a time as possible; + * thus, this function simply makes a copy of the necessary data and releases + * the locks, allowing the caller to contemplate and format the data for as + * long as it pleases. + */ + BlockedProcsData * + GetBlockerStatusData(int blocked_pid) + { + BlockedProcsData *data; + PGPROC *proc; + int i; + + data = (BlockedProcsData *) palloc(sizeof(BlockedProcsData)); + + /* + * Guess how much space we'll need, and preallocate. Most of the time + * this will avoid needing to do repalloc while holding the LWLocks. (We + * assume, but check with an Assert, that MaxBackends is enough entries + * for the procs[] array; the other two could need enlargement, though.) + */ + data->nprocs = data->nlocks = data->npids = 0; + data->maxprocs = data->maxlocks = data->maxpids = MaxBackends; + data->procs = (BlockedProcData *) palloc(sizeof(BlockedProcData) * data->maxprocs); + data->locks = (LockInstanceData *) palloc(sizeof(LockInstanceData) * data->maxlocks); + data->waiter_pids = (int *) palloc(sizeof(int) * data->maxpids); + + /* + * In order to search the ProcArray for blocked_pid and assume that that + * entry won't immediately disappear under us, we must hold ProcArrayLock. + * In addition, to examine the lock grouping fields of any other backend, + * we must hold all the hash partition locks. (Only one of those locks is + * actually relevant for any one lock group, but we can't know which one + * ahead of time.) It's fairly annoying to hold all those locks + * throughout this, but it's no worse than GetLockStatusData(), and it + * does have the advantage that we're guaranteed to return a + * self-consistent instantaneous state. + */ + LWLockAcquire(ProcArrayLock, LW_SHARED); + + proc = BackendPidGetProcWithLock(blocked_pid); + + /* Nothing to do if it's gone */ + if (proc != NULL) + { + /* + * Acquire lock on the entire shared lock data structure. See notes + * in GetLockStatusData(). + */ + for (i = 0; i < NUM_LOCK_PARTITIONS; i++) + LWLockAcquire(LockHashPartitionLockByIndex(i), LW_SHARED); + + if (proc->lockGroupLeader == NULL) + { + /* Easy case, proc is not a lock group member */ + GetSingleProcBlockerStatusData(proc, data); + } + else + { + /* Examine all procs in proc's lock group */ + dlist_iter iter; + + dlist_foreach(iter, &proc->lockGroupLeader->lockGroupMembers) + { + PGPROC *memberProc; + + memberProc = dlist_container(PGPROC, lockGroupLink, iter.cur); + GetSingleProcBlockerStatusData(memberProc, data); + } + } + + /* + * And release locks. See notes in GetLockStatusData(). + */ + for (i = NUM_LOCK_PARTITIONS; --i >= 0;) + LWLockRelease(LockHashPartitionLockByIndex(i)); + + Assert(data->nprocs <= data->maxprocs); + } + + LWLockRelease(ProcArrayLock); + + return data; + } + + /* Accumulate data about one possibly-blocked proc for GetBlockerStatusData */ + static void + GetSingleProcBlockerStatusData(PGPROC *blocked_proc, BlockedProcsData *data) + { + LOCK *theLock = blocked_proc->waitLock; + BlockedProcData *bproc; + SHM_QUEUE *procLocks; + PROCLOCK *proclock; + PROC_QUEUE *waitQueue; + PGPROC *proc; + int queue_size; + int i; + + /* Nothing to do if this proc is not blocked */ + if (theLock == NULL) + return; + + /* Set up a procs[] element */ + bproc = &data->procs[data->nprocs++]; + bproc->pid = blocked_proc->pid; + bproc->first_lock = data->nlocks; + bproc->first_waiter = data->npids; + + /* + * We may ignore the proc's fast-path arrays, since nothing in those could + * be related to a contended lock. + */ + + /* Collect all PROCLOCKs associated with theLock */ + procLocks = &(theLock->procLocks); + proclock = (PROCLOCK *) SHMQueueNext(procLocks, procLocks, + offsetof(PROCLOCK, lockLink)); + while (proclock) + { + PGPROC *proc = proclock->tag.myProc; + LOCK *lock = proclock->tag.myLock; + LockInstanceData *instance; + + if (data->nlocks >= data->maxlocks) + { + data->maxlocks += MaxBackends; + data->locks = (LockInstanceData *) + repalloc(data->locks, sizeof(LockInstanceData) * data->maxlocks); + } + + instance = &data->locks[data->nlocks]; + memcpy(&instance->locktag, &lock->tag, sizeof(LOCKTAG)); + instance->holdMask = proclock->holdMask; + if (proc->waitLock == lock) + instance->waitLockMode = proc->waitLockMode; + else + instance->waitLockMode = NoLock; + instance->backend = proc->backendId; + instance->lxid = proc->lxid; + instance->pid = proc->pid; + instance->leaderPid = proclock->groupLeader->pid; + instance->fastpath = false; + data->nlocks++; + + proclock = (PROCLOCK *) SHMQueueNext(procLocks, &proclock->lockLink, + offsetof(PROCLOCK, lockLink)); + } + + /* Enlarge waiter_pids[] if it's too small to hold all wait queue PIDs */ + waitQueue = &(theLock->waitProcs); + queue_size = waitQueue->size; + + if (queue_size > data->maxpids - data->npids) + { + data->maxpids = Max(data->maxpids + MaxBackends, + data->npids + queue_size); + data->waiter_pids = (int *) repalloc(data->waiter_pids, + sizeof(int) * data->maxpids); + } + + /* Collect PIDs from the lock's wait queue, stopping at blocked_proc */ + proc = (PGPROC *) waitQueue->links.next; + for (i = 0; i < queue_size; i++) + { + if (proc == blocked_proc) + break; + data->waiter_pids[data->npids++] = proc->pid; + proc = (PGPROC *) proc->links.next; + } + + bproc->num_locks = data->nlocks - bproc->first_lock; + bproc->num_waiters = data->npids - bproc->first_waiter; + } + + /* * Returns a list of currently held AccessExclusiveLocks, for use by * LogStandbySnapshot(). The result is a palloc'd array, * with the number of elements returned into *nlocks. diff --git a/src/backend/utils/adt/lockfuncs.c b/src/backend/utils/adt/lockfuncs.c index 73c78e9..6bcab81 100644 *** a/src/backend/utils/adt/lockfuncs.c --- b/src/backend/utils/adt/lockfuncs.c *************** *** 18,23 **** --- 18,24 ---- #include "funcapi.h" #include "miscadmin.h" #include "storage/predicate_internals.h" + #include "utils/array.h" #include "utils/builtins.h" *************** pg_lock_status(PG_FUNCTION_ARGS) *** 99,105 **** oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx); /* build tupdesc for result tuples */ ! /* this had better match pg_locks view in system_views.sql */ tupdesc = CreateTemplateTupleDesc(NUM_LOCK_STATUS_COLUMNS, false); TupleDescInitEntry(tupdesc, (AttrNumber) 1, "locktype", TEXTOID, -1, 0); --- 100,106 ---- oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx); /* build tupdesc for result tuples */ ! /* this had better match function's declaration in pg_proc.h */ tupdesc = CreateTemplateTupleDesc(NUM_LOCK_STATUS_COLUMNS, false); TupleDescInitEntry(tupdesc, (AttrNumber) 1, "locktype", TEXTOID, -1, 0); *************** pg_lock_status(PG_FUNCTION_ARGS) *** 395,400 **** --- 396,523 ---- /* + * pg_blocking_pids - produce an array of the PIDs blocking given PID + * + * The reported PIDs are those that hold a lock conflicting with blocked_pid's + * current request (hard block), or are requesting such a lock and are ahead + * of blocked_pid in the lock's wait queue (soft block). + * + * In parallel-query cases, we report all PIDs blocking any member of the + * given PID's lock group, and the reported PIDs are those of the blocking + * PIDs' lock group leaders. This allows callers to compare the result to + * lists of clients' pg_backend_pid() results even during a parallel query. + * + * Parallel query makes it possible for there to be duplicate PIDs in the + * result (either because multiple waiters are blocked by same PID, or + * because multiple blockers have same group leader PID). We do not bother + * to eliminate such duplicates from the result. + * + * We need not consider predicate locks here, since those don't block anything. + */ + Datum + pg_blocking_pids(PG_FUNCTION_ARGS) + { + int blocked_pid = PG_GETARG_INT32(0); + Datum *arrayelems; + int narrayelems; + BlockedProcsData *lockData; /* state data from lmgr */ + int i, + j; + + /* Collect a snapshot of lock manager state */ + lockData = GetBlockerStatusData(blocked_pid); + + /* We can't need more output entries than there are reported PROCLOCKs */ + arrayelems = (Datum *) palloc(lockData->nlocks * sizeof(Datum)); + narrayelems = 0; + + /* For each blocked proc in the lock group ... */ + for (i = 0; i < lockData->nprocs; i++) + { + BlockedProcData *bproc = &lockData->procs[i]; + LockInstanceData *instances = &lockData->locks[bproc->first_lock]; + int *preceding_waiters = &lockData->waiter_pids[bproc->first_waiter]; + LockInstanceData *blocked_instance; + LockMethod lockMethodTable; + int conflictMask; + + /* + * Locate the blocked proc's own entry in the LockInstanceData array. + * There should be exactly one matching entry. + */ + blocked_instance = NULL; + for (j = 0; j < bproc->num_locks; j++) + { + LockInstanceData *instance = &(instances[j]); + + if (instance->pid == bproc->pid) + { + Assert(blocked_instance == NULL); + blocked_instance = instance; + } + } + Assert(blocked_instance != NULL); + + lockMethodTable = GetLockTagsMethodTable(&(blocked_instance->locktag)); + conflictMask = lockMethodTable->conflictTab[blocked_instance->waitLockMode]; + + /* Now scan the PROCLOCK data for conflicting procs */ + for (j = 0; j < bproc->num_locks; j++) + { + LockInstanceData *instance = &(instances[j]); + + /* A proc never blocks itself, so ignore that entry */ + if (instance == blocked_instance) + continue; + /* Members of same lock group never block each other, either */ + if (instance->leaderPid == blocked_instance->leaderPid) + continue; + + if (conflictMask & instance->holdMask) + { + /* hard block: blocked by lock already held by this entry */ + } + else if (instance->waitLockMode != NoLock && + (conflictMask & LOCKBIT_ON(instance->waitLockMode))) + { + /* conflict in lock requests; who's in front in wait queue? */ + bool ahead = false; + int k; + + for (k = 0; k < bproc->num_waiters; k++) + { + if (preceding_waiters[k] == instance->pid) + { + /* soft block: this entry is ahead of blocked proc */ + ahead = true; + break; + } + } + if (!ahead) + continue; /* not blocked by this entry */ + } + else + { + /* not blocked by this entry */ + continue; + } + + /* blocked by this entry, so emit a record */ + arrayelems[narrayelems++] = Int32GetDatum(instance->leaderPid); + } + } + + /* Assert we didn't overrun arrayelems[] */ + Assert(narrayelems <= lockData->nlocks); + + /* Construct array, using hardwired knowledge about int4 type */ + PG_RETURN_ARRAYTYPE_P(construct_array(arrayelems, narrayelems, + INT4OID, + sizeof(int32), true, 'i')); + } + + + /* * Functions for manipulating advisory locks * * We make use of the locktag fields as follows: diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h index 59c50d9..62b9125 100644 *** a/src/include/catalog/pg_proc.h --- b/src/include/catalog/pg_proc.h *************** DATA(insert OID = 3329 ( pg_show_all_fi *** 3012,3017 **** --- 3012,3019 ---- DESCR("show config file settings"); DATA(insert OID = 1371 ( pg_lock_status PGNSP PGUID 12 1 1000 0 0 f f f f t t v s 0 0 2249 "" "{25,26,26,23,21,25,28,26,26,21,25,23,25,16,16}""{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{locktype,database,relation,page,tuple,virtualxid,transactionid,classid,objid,objsubid,virtualtransaction,pid,mode,granted,fastpath}" _null__null_ pg_lock_status _null_ _null_ _null_ )); DESCR("view system lock information"); + DATA(insert OID = 2561 ( pg_blocking_pids PGNSP PGUID 12 1 0 0 0 f f f f t f v s 1 0 1007 "23" _null_ _null_ _null_ _null__null_ pg_blocking_pids _null_ _null_ _null_ )); + DESCR("get array of PIDs of sessions blocking specified backend PID"); DATA(insert OID = 1065 ( pg_prepared_xact PGNSP PGUID 12 1 1000 0 0 f f f f t t v s 0 0 2249 "" "{28,25,1184,26,26}" "{o,o,o,o,o}""{transaction,gid,prepared,ownerid,dbid}" _null_ _null_ pg_prepared_xact _null_ _null_ _null_ )); DESCR("view two-phase transactions"); DATA(insert OID = 3819 ( pg_get_multixact_members PGNSP PGUID 12 1 1000 0 0 f f f f t t v s 1 0 2249 "28" "{28,28,25}""{i,o,o}" "{multixid,xid,mode}" _null_ _null_ pg_get_multixact_members _null_ _null_ _null_ )); diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h index 703eaf2..788d50a 100644 *** a/src/include/storage/lock.h --- b/src/include/storage/lock.h *************** typedef struct PROCLOCK *** 346,352 **** PROCLOCKTAG tag; /* unique identifier of proclock object */ /* data */ ! PGPROC *groupLeader; /* group leader, or NULL if no lock group */ LOCKMASK holdMask; /* bitmask for lock types currently held */ LOCKMASK releaseMask; /* bitmask for lock types to be released */ SHM_QUEUE lockLink; /* list link in LOCK's list of proclocks */ --- 346,352 ---- PROCLOCKTAG tag; /* unique identifier of proclock object */ /* data */ ! PGPROC *groupLeader; /* proc's lock group leader, or proc itself */ LOCKMASK holdMask; /* bitmask for lock types currently held */ LOCKMASK releaseMask; /* bitmask for lock types to be released */ SHM_QUEUE lockLink; /* list link in LOCK's list of proclocks */ *************** typedef struct LOCALLOCK *** 423,443 **** typedef struct LockInstanceData { ! LOCKTAG locktag; /* locked object */ LOCKMASK holdMask; /* locks held by this PGPROC */ LOCKMODE waitLockMode; /* lock awaited by this PGPROC, if any */ BackendId backend; /* backend ID of this PGPROC */ LocalTransactionId lxid; /* local transaction ID of this PGPROC */ int pid; /* pid of this PGPROC */ bool fastpath; /* taken via fastpath? */ } LockInstanceData; typedef struct LockData { int nelements; /* The length of the array */ ! LockInstanceData *locks; } LockData; /* Result codes for LockAcquire() */ typedef enum --- 423,470 ---- typedef struct LockInstanceData { ! LOCKTAG locktag; /* tag for locked object */ LOCKMASK holdMask; /* locks held by this PGPROC */ LOCKMODE waitLockMode; /* lock awaited by this PGPROC, if any */ BackendId backend; /* backend ID of this PGPROC */ LocalTransactionId lxid; /* local transaction ID of this PGPROC */ int pid; /* pid of this PGPROC */ + int leaderPid; /* pid of group leader; = pid if no group */ bool fastpath; /* taken via fastpath? */ } LockInstanceData; typedef struct LockData { int nelements; /* The length of the array */ ! LockInstanceData *locks; /* Array of per-PROCLOCK information */ } LockData; + typedef struct BlockedProcData + { + int pid; /* pid of a blocked PGPROC */ + /* Per-PROCLOCK information about PROCLOCKs of the lock the pid awaits */ + /* (these fields refer to indexes in BlockedProcsData.locks[]) */ + int first_lock; /* index of first relevant LockInstanceData */ + int num_locks; /* number of relevant LockInstanceDatas */ + /* PIDs of PGPROCs that are ahead of "pid" in the lock's wait queue */ + /* (these fields refer to indexes in BlockedProcsData.waiter_pids[]) */ + int first_waiter; /* index of first preceding waiter */ + int num_waiters; /* number of preceding waiters */ + } BlockedProcData; + + typedef struct BlockedProcsData + { + BlockedProcData *procs; /* Array of per-blocked-proc information */ + LockInstanceData *locks; /* Array of per-PROCLOCK information */ + int *waiter_pids; /* Array of PIDs of other blocked PGPROCs */ + int nprocs; /* # of valid entries in procs[] array */ + int maxprocs; /* Allocated length of procs[] array */ + int nlocks; /* # of valid entries in locks[] array */ + int maxlocks; /* Allocated length of locks[] array */ + int npids; /* # of valid entries in waiter_pids[] array */ + int maxpids; /* Allocated length of waiter_pids[] array */ + } BlockedProcsData; + /* Result codes for LockAcquire() */ typedef enum *************** typedef enum *** 489,494 **** --- 516,522 ---- */ extern void InitLocks(void); extern LockMethod GetLocksMethodTable(const LOCK *lock); + extern LockMethod GetLockTagsMethodTable(const LOCKTAG *locktag); extern uint32 LockTagHashCode(const LOCKTAG *locktag); extern bool DoLockModesConflict(LOCKMODE mode1, LOCKMODE mode2); extern LockAcquireResult LockAcquire(const LOCKTAG *locktag, *************** extern void GrantAwaitedLock(void); *** 521,526 **** --- 549,555 ---- extern void RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode); extern Size LockShmemSize(void); extern LockData *GetLockStatusData(void); + extern BlockedProcsData *GetBlockerStatusData(int blocked_pid); extern xl_standby_lock *GetRunningTransactionLocks(int *nlocks); extern const char *GetLockmodeName(LOCKMETHODID lockmethodid, LOCKMODE mode); diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h index 1fbf4f3..dd37c0c 100644 *** a/src/include/storage/procarray.h --- b/src/include/storage/procarray.h *************** extern VirtualTransactionId *GetVirtualX *** 61,66 **** --- 61,67 ---- extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids); extern PGPROC *BackendPidGetProc(int pid); + extern PGPROC *BackendPidGetProcWithLock(int pid); extern int BackendXidGetPid(TransactionId xid); extern bool IsBackendPid(int pid); diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h index 94c1881..7ec93c9 100644 *** a/src/include/utils/builtins.h --- b/src/include/utils/builtins.h *************** extern Datum row_security_active_name(PG *** 1157,1162 **** --- 1157,1163 ---- /* lockfuncs.c */ extern Datum pg_lock_status(PG_FUNCTION_ARGS); + extern Datum pg_blocking_pids(PG_FUNCTION_ARGS); extern Datum pg_advisory_lock_int8(PG_FUNCTION_ARGS); extern Datum pg_advisory_xact_lock_int8(PG_FUNCTION_ARGS); extern Datum pg_advisory_lock_shared_int8(PG_FUNCTION_ARGS); diff --git a/src/test/isolation/isolationtester.c b/src/test/isolation/isolationtester.c index 0a9d25c..6461ae8 100644 *** a/src/test/isolation/isolationtester.c --- b/src/test/isolation/isolationtester.c *************** main(int argc, char **argv) *** 227,253 **** */ initPQExpBuffer(&wait_query); appendPQExpBufferStr(&wait_query, ! "SELECT 1 FROM pg_locks holder, pg_locks waiter " ! "WHERE NOT waiter.granted AND waiter.pid = $1 " ! "AND holder.granted " ! "AND holder.pid <> $1 AND holder.pid IN ("); /* The spec syntax requires at least one session; assume that here. */ appendPQExpBufferStr(&wait_query, backend_pids[1]); for (i = 2; i < nconns; i++) ! appendPQExpBuffer(&wait_query, ", %s", backend_pids[i]); ! appendPQExpBufferStr(&wait_query, ! ") " ! ! "AND holder.locktype IS NOT DISTINCT FROM waiter.locktype " ! "AND holder.database IS NOT DISTINCT FROM waiter.database " ! "AND holder.relation IS NOT DISTINCT FROM waiter.relation " ! "AND holder.page IS NOT DISTINCT FROM waiter.page " ! "AND holder.tuple IS NOT DISTINCT FROM waiter.tuple " ! "AND holder.virtualxid IS NOT DISTINCT FROM waiter.virtualxid " ! "AND holder.transactionid IS NOT DISTINCT FROM waiter.transactionid " ! "AND holder.classid IS NOT DISTINCT FROM waiter.classid " ! "AND holder.objid IS NOT DISTINCT FROM waiter.objid " ! "AND holder.objsubid IS NOT DISTINCT FROM waiter.objsubid "); res = PQprepare(conns[0], PREP_WAITING, wait_query.data, 0, NULL); if (PQresultStatus(res) != PGRES_COMMAND_OK) --- 227,238 ---- */ initPQExpBuffer(&wait_query); appendPQExpBufferStr(&wait_query, ! "SELECT pg_catalog.pg_blocking_pids($1) && '{"); /* The spec syntax requires at least one session; assume that here. */ appendPQExpBufferStr(&wait_query, backend_pids[1]); for (i = 2; i < nconns; i++) ! appendPQExpBuffer(&wait_query, ",%s", backend_pids[i]); ! appendPQExpBufferStr(&wait_query, "}'::integer[]"); res = PQprepare(conns[0], PREP_WAITING, wait_query.data, 0, NULL); if (PQresultStatus(res) != PGRES_COMMAND_OK) *************** try_complete_step(Step *step, int flags) *** 745,765 **** /* If it's OK for the step to block, check whether it has. */ if (flags & STEP_NONBLOCK) { ! int ntuples; res = PQexecPrepared(conns[0], PREP_WAITING, 1, &backend_pids[step->session + 1], NULL, NULL, 0); ! if (PQresultStatus(res) != PGRES_TUPLES_OK) { fprintf(stderr, "lock wait query failed: %s", PQerrorMessage(conn)); exit_nicely(); } ! ntuples = PQntuples(res); PQclear(res); ! if (ntuples >= 1) /* waiting to acquire a lock */ { if (!(flags & STEP_RETRY)) printf("step %s: %s <waiting ...>\n", --- 730,751 ---- /* If it's OK for the step to block, check whether it has. */ if (flags & STEP_NONBLOCK) { ! bool waiting; res = PQexecPrepared(conns[0], PREP_WAITING, 1, &backend_pids[step->session + 1], NULL, NULL, 0); ! if (PQresultStatus(res) != PGRES_TUPLES_OK || ! PQntuples(res) != 1) { fprintf(stderr, "lock wait query failed: %s", PQerrorMessage(conn)); exit_nicely(); } ! waiting = ((PQgetvalue(res, 0, 0))[0] == 't'); PQclear(res); ! if (waiting) /* waiting to acquire a lock */ { if (!(flags & STEP_RETRY)) printf("step %s: %s <waiting ...>\n",
On Wed, Feb 17, 2016 at 9:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I just had a rather disturbing thought to the effect that this entire > design --- ie, parallel workers taking out locks for themselves --- is > fundamentally flawed. As far as I can tell from README.parallel, > parallel workers are supposed to exit (and, presumably, release their > locks) before the leader's transaction commits. Releasing locks before > commit is wrong. Do I need to rehearse why? No, you don't. I've spent a good deal of time thinking about that problem. In typical cases, workers are going to be acquiring either catalog locks (which are released before commit) or locks on relations which the leader has already locked (in which case the leader will still hold the lock - or possibly a stronger one - even after the worker releases that lock). Suppose, however, that you write a function which goes and queries some other table not involved in the query, and therefore acquires a lock on it. If you mark that function PARALLEL SAFE and it runs only in the worker and not in in the leader, then you could end up with a parallel query that releases the lock before commit where a non-parallel version of that query would have held the lock until transaction commit. Of course, one answer to this problem is - if the early lock release is apt to be a problem for you - don't mark such functions PARALLEL SAFE. I've thought about engineering a better solution. Two possible designs come to mind. First, we could have the worker send to the leader a list of locks that it holds at the end of its work, and the leader could acquire all of those before confirming to the worker that it is OK to terminate. That has some noteworthy disadvantages, like being prone to deadlock and requiring workers to stick around potentially quite a bit longer than they do at present, thus limiting the ability of other processes to access parallel query. Second, we could have the workers reassign all of their locks to the leader in the lock table (unless the leader already holds that lock). The problem with that is that then the leader is in the weird situation of having locks in the shared lock table that it doesn't know anything about - they don't appear in it's local lock table. How does the leader decide which resource owner they go with? Unless I'm missing something, though, this is a fairly obscure problem. Early release of catalog locks is desirable, and locks on scanned tables should be the same locks (or weaker) than already held by the master. Other cases are rare. I think. It would be good to know if you think otherwise. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Feb 22, 2016 at 2:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> As for the patch itself, I'm having trouble grokking what it's trying >>> to do. I think it might be worth having a comment defining precisely >>> what we mean by "A blocks B". I would define "A blocks B" in general >>> as either A holds a lock which conflicts with one sought by B >>> (hard-blocked) or A awaits a lock which conflicts with one sought by B >>> and precedes it in the wait queue (soft-blocked). > >> Yes, that is exactly what I implemented ... and it's something you can't >> find out from pg_locks. I'm not sure how that view could be made to >> expose wait-queue ordering. > > Here's an updated version of this patch, now with user-facing docs. > > I decided that "pg_blocking_pids()" is a better function name than > "pg_blocker_pids()". The code's otherwise the same, although I > revisited some of the comments. > > I also changed quite a few references to "transaction" into "process" > in the discussion of pg_locks. The previous choice to conflate > processes with transactions was never terribly wise in my view, and > it's certainly completely broken by parallel query. ! held by the indicated process. False indicates that this process is ! currently waiting to acquire this lock, which implies that at least one other ! process is holding a conflicting lock mode on the same lockable object. I know you're just updating existing language here, but this is false. It only implies that one other process is holding *or waiting for* a conflicting lock mode on the same lockable object. Other than that, I think the documentation changes look good. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Feb 17, 2016 at 9:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I just had a rather disturbing thought to the effect that this entire >> design --- ie, parallel workers taking out locks for themselves --- is >> fundamentally flawed. As far as I can tell from README.parallel, >> parallel workers are supposed to exit (and, presumably, release their >> locks) before the leader's transaction commits. Releasing locks before >> commit is wrong. Do I need to rehearse why? > No, you don't. I've spent a good deal of time thinking about that problem. > [ much snipped ] > Unless I'm missing something, though, this is a fairly obscure > problem. Early release of catalog locks is desirable, and locks on > scanned tables should be the same locks (or weaker) than already held > by the master. Other cases are rare. I think. It would be good to > know if you think otherwise. After further thought, what I think about this is that it's safe so long as parallel workers are strictly read-only. Given that, early lock release after user table access is okay for the same reasons it's okay after catalog accesses. However, this is one of the big problems that we'd have to have a solution for before we ever consider allowing read-write parallelism. So what distresses me about the current situation is that this is a large stumbling block that I don't see documented anywhere. It'd be good if you transcribed some of this conversation into README.parallel. (BTW, I don't believe your statement that transferring locks back to the master would be deadlock-prone. If the lock system treats locks held by a lock group as effectively all belonging to one entity, then granting a lock identical to one already held by another group member should never fail. I concur that it might be expensive performance-wise, though it hardly seems like this would be a dominant factor compared to all the other costs of setting up and tearing down parallel workers.) regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Feb 22, 2016 at 2:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > ! held by the indicated process. False indicates that this process is > ! currently waiting to acquire this lock, which implies that at > least one other > ! process is holding a conflicting lock mode on the same lockable object. > I know you're just updating existing language here, but this is false. > It only implies that one other process is holding *or waiting for* a > conflicting lock mode on the same lockable object. True. I had considered whether to fix that point as well, and decided that it might just be overcomplicating matters. But since you complain, I'll add "or waiting for". It also occurred to me last night that pg_blocking_pids() needs a disclaimer similar to the existing one for pg_locks about how using it a lot could put a performance drag on the system. Other than adjusting those points, I think this is ready to go, and will commit later today if I hear no objections. regards, tom lane
Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.
От
Stephen Frost
Дата:
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Wed, Feb 17, 2016 at 9:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> I just had a rather disturbing thought to the effect that this entire > >> design --- ie, parallel workers taking out locks for themselves --- is > >> fundamentally flawed. As far as I can tell from README.parallel, > >> parallel workers are supposed to exit (and, presumably, release their > >> locks) before the leader's transaction commits. Releasing locks before > >> commit is wrong. Do I need to rehearse why? > > > No, you don't. I've spent a good deal of time thinking about that problem. > > [ much snipped ] > > Unless I'm missing something, though, this is a fairly obscure > > problem. Early release of catalog locks is desirable, and locks on > > scanned tables should be the same locks (or weaker) than already held > > by the master. Other cases are rare. I think. It would be good to > > know if you think otherwise. > > After further thought, what I think about this is that it's safe so long > as parallel workers are strictly read-only. Given that, early lock > release after user table access is okay for the same reasons it's okay > after catalog accesses. However, this is one of the big problems that > we'd have to have a solution for before we ever consider allowing > read-write parallelism. Having such a blocker for read-write parallelism would be unfortunate, though perhaps there isn't much help for it, and having read-only query parallelism is certainly far better than nothing. > So what distresses me about the current situation is that this is a large > stumbling block that I don't see documented anywhere. It'd be good if you > transcribed some of this conversation into README.parallel. Agreed. > (BTW, I don't believe your statement that transferring locks back to the > master would be deadlock-prone. If the lock system treats locks held by > a lock group as effectively all belonging to one entity, then granting a > lock identical to one already held by another group member should never > fail. I concur that it might be expensive performance-wise, though it > hardly seems like this would be a dominant factor compared to all the > other costs of setting up and tearing down parallel workers.) This is only when a parallel worker is finished, no? Isn't there already some indication of when a parallel worker is done that the master handles, where it could also check the shared lock table and see if any locks were transferred to it on worker exit? Only following this thread from afar, so take my suggestions with a grain of salt. Thanks! Stephen
Stephen Frost <sfrost@snowman.net> writes: > * Tom Lane (tgl@sss.pgh.pa.us) wrote: >> ... However, this is one of the big problems that >> we'd have to have a solution for before we ever consider allowing >> read-write parallelism. > Having such a blocker for read-write parallelism would be unfortunate, > though perhaps there isn't much help for it, and having read-only query > parallelism is certainly far better than nothing. IIUC, this is not the only large problem standing between us and read-write parallelism, and probably not even the biggest one. So I'm basically just asking that it gets documented while it's fresh in mind, rather than leaving it for some future hackers to rediscover the hard way. (Wouldn't be bad to doc the other known stumbling blocks, too.) regards, tom lane
On Mon, Feb 22, 2016 at 7:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> No, you don't. I've spent a good deal of time thinking about that problem. >> [ much snipped ] >> Unless I'm missing something, though, this is a fairly obscure >> problem. Early release of catalog locks is desirable, and locks on >> scanned tables should be the same locks (or weaker) than already held >> by the master. Other cases are rare. I think. It would be good to >> know if you think otherwise. > > After further thought, what I think about this is that it's safe so long > as parallel workers are strictly read-only. Given that, early lock > release after user table access is okay for the same reasons it's okay > after catalog accesses. However, this is one of the big problems that > we'd have to have a solution for before we ever consider allowing > read-write parallelism. Actually, I don't quite see what read-only vs. read-write queries has to do with this particular issue. We retain relation locks on target relations until commit, regardless of whether those locks are AccessShareLock, RowShareLock, or RowExclusiveLock. As far as I understand it, this isn't because anything would fail horribly if we released those locks at end of query, but rather because we think that releasing those locks early might result in unpleasant surprises for client applications. I'm actually not really convinced that's true: I will grant that it might be surprising to run the same query twice in the same transaction and get different tuple descriptors, but it might also be surprising to get different rows, which READ COMMITTED allows anyway. And I've met a few users who were pretty surprised to find out that they couldn't do DDL on table A and the blocking session mentioned table A nowhere in the currently-executing query. The main issues with allowing read-write parallelism that I know of off-hand are: * Updates or deletes might create new combo CIDs. In order to handle that, we'd need to store the combo CID mapping in some sort of DSM-based data structure which could expand as new combo CIDs were generated. * Relation extension locks, and a few other types of heavyweight locks, are used for mutual exclusion of operations that would need to be serialized even among cooperating backends. So group locking would need to be enhanced to handle those cases differently, or some other solution would need to be found. (I've done some more detailed analysis here about possible solutions most of which has been posted to -hackers in various emails at one time or another; I'll refrain from diving into all the details in this missive.) But those are separate from the question of whether parallel workers need to transfer any heavyweight locks they accumulate on non-scanned tables back to the leader. > So what distresses me about the current situation is that this is a large > stumbling block that I don't see documented anywhere. It'd be good if you > transcribed some of this conversation into README.parallel. > > (BTW, I don't believe your statement that transferring locks back to the > master would be deadlock-prone. If the lock system treats locks held by > a lock group as effectively all belonging to one entity, then granting a > lock identical to one already held by another group member should never > fail. I concur that it might be expensive performance-wise, though it > hardly seems like this would be a dominant factor compared to all the > other costs of setting up and tearing down parallel workers.) I don't mean that the heavyweight lock acquisition itself would fail; I agree with your analysis on that. I mean that you'd have to design the protocol for the leader and the worker to communicate very carefully in order for it not to get stuck. Right now, the leader initializes the DSM at startup before any workers are running with all the data the workers will need, and after that data flows strictly from workers to leader. So the workers could send control messages indicating heavyweight locks that they held to the leader, and that would be fine. Then the leader would need to read those messages and do something with them, after which it would need to tell the workers that they could now exit. You'd need to make sure there was no situation in which that handshake couldn't get stuck, for example because the leader was waiting for a tuple from the worker while the worker was waiting for a lock-acquisition-confirmation from the leader. That particular thing is probably not an issue but hopefully it illustrates the sort of hazard I'm concerned about. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company