Обсуждение: Re: [COMMITTERS] pgsql: Introduce group locking to prevent parallel processes from deadl
On Sat, Feb 13, 2016 at 1:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <rhaas@postgresql.org> writes: >> Introduce group locking to prevent parallel processes from deadlocking. > > I'm fairly unhappy about this patch, because it has introduced a large > amount of new complexity into the lock manager with effectively no > documentation. Yeah, there's several paragraphs added to lmgr/README, > but they're just a handwavy apologia for the high-level decision to allow > exclusive locks taken by parallel workers to not conflict. I see nothing > whatever explaining how it works, that is any useful docs about the new > data structures or invariants. For example, I can't figure out whether > you have broken the pg_locks display (by causing relevant lock state to > not get displayed, or get displayed in a useless fashion because one > can no longer tell which entries block which other entries). I can't > even tell for sure if it works at all: the README says only that locks > taken by members of the same parallel group don't conflict, but the > implementation looks more like some operations get applied to the group > leader rather than to followers, which is something completely different. OK. I don't know exactly what new documentation needs to be written, but I'm happy to try to work to improve it. Obviously, it was clear enough to me, but that's not saying a lot. Let me try to answer some of your questions here via email, and then we can decide what of that should go into comments, the lmgr README, or someplace else. First, the overall concept here is that processes can either be a member of a lock group or a member of no lock group. The concept of a lock group is formally separate from the concept of a parallel group created by a ParallelContext, but it is not clear that there will ever be any other context in which a lock group will be a good idea. It is not impossible to imagine: for example, suppose you had a group of backends that all had TCP client connections, and those processes all wanted to ingest data and stuff it in a table but without allowing any unrelated process to touch the table, say because it was going to be inconsistent during the operation and until indexes were afterwards rebuilt. I don't have any plans to implement anything like that but I felt it was better to keep the concept of a lock group - which is a group of processes that cooperate so closely that their locks need not conflict - from the concept of a parallel context - which is a leader process that is most likely connected to a user plus a bunch of ephemeral background workers that aren't. That way, if somebody later wants to try to reuse the lock grouping stuff for something else, nothing will get in the way of that; if not, no harm done, but keeping the two things decoupled is at least easier to understand, IMHO. Second, I can review the data structure changes and the associated invariants. Each PGPROC has four new members: lockGroupLeaderIdentifier, lockGroupLeader, lockGroupMembers, and lockGroupLink. The first is simply a safety mechanism. A newly started parallel worker has to try to join the leader's lock group, but it has no guarantee that the group leader is still alive by the time it gets started. I tried to ensure that the parallel leader dies after all workers in normal cases, but also that the system could survive relatively intact if that somehow fails to happen. This is one of the precautions against such a scenario: the leader relays its PGPROC and also its PID to the worker, and the worker fails to join the lock group unless the given PGPROC still has the same PID. I assume that PIDs are not recycled quickly enough for this interlock to fail. lockGroupLeader is NULL for processes not involved in parallel query. When a process wants to cooperate with parallel workers, it becomes a lock group leader, which means setting this field back to point to its own PGPROC. When a parallel worker starts up, it points this field at the leader, with the above-mentioned interlock. The lockGroupMembers field is only used in the leader; it is a list of the workers. The lockGroupLink field is used to link the leader and all workers into the leader's list. All of these fields are protected by the lock manager locks; the lock manager lock that protects the fields in a given PGPROC is chosen by taking pgprocno modulo the number of lock manager partitions. This unusual arrangement has a major advantage: the deadlock detector can count on the fact that no lockGroupLeader field can change while the deadlock detector is running, because it knows that it holds all the lock manager locks. Earlier versions of the patch protected these fields with the PGPROC's backendLock, but it was very hard to be confident that this wouldn't lead to deadlock where one process acquires a lock manager lock and then a backendLock and another process does the reverse. Even if that didn't happen, it was inconvenient not to be able to access the fields without additional locking. Each PROCLOCK now has a new groupLeader flag. While a PGPROC's lockGroupLeader may be NULL if that process is not involved in group locking, a PROCLOCK's groupLeader always has a non-NULL value; it points back to the PGPROC that acquired the lock. There's no horribly principled reason for this different contract; it just turned out to be more convenient. PGPROCs need to distinguish whether they are involved in group locking or not, and using NULL as a sentinel value for lockGroupLeader is a convenient way to do that; but it turns out we don't really need to know that for a PROCLOCK, so sending groupLeader equal to myProc when there's no group locking involved is more convenient. Within the deadlock detector, an EDGE now represents a constraint between two lock groups rather than between two processes; therefore, the waiter and blocker are the respective group leaders rather than the processes actually waiting. A LOCK * is added to the EDGE data structure for this reason -- different processes in the lock group might be waiting for different locks, and we need to be sure about which lock we're constraining. It's possible for there to be multiple processes on the same wait queue for the same LOCK; in fact, this is probably the most common scenario. Imagine that the leader acquires a lock, probably AccessShareLock, but then before the workers start up, someone queues up for an AccessExclusiveLock. The workers will all be soft-blocked by that other process, but must go ahead of it; the leader won't release it's lock until some time after the query finishes, and that won't happen until the workers finish. TopoSort() handles this by emitting all processes in the same lock group adjacent to each other in the output wait ordering. As the comment explains: /* * Output everything in the lock group. There's no point in outputing * an ordering where membersof the same lock group are not * consecutive on the wait queue: if some other waiter is between two * requests that belong to the same group, then either it conflicts * with both of them and is certainly not a solution;or it conflicts * with at most one of them and is thus isomorphic to an ordering * where the groupmembers are consecutive. */ There are a number of other new comments in TopoSort() which I believe to be helpful for understanding how it goes about its mission. With respect to pg_locks - and for that matter also pg_stat_activity - I think you are right that improvement is needed. I really haven't really done anything about those during the whole course of developing parallel query, so the information that they show for parallel workers is precisely the same information that they would show for any other background worker. Now, background workers do show up in both places, but it's not straightforward to identify which background workers go with which parallel leader or whether group locking is in use. The simplest thing we could do to make that easier is, in pg_stat_activity, have parallel workers advertise the PID that launched them in a new field; and in pg_locks, have members of a lock group advertise the leader's PID in a new field. I think that would make it pretty simple to figure out what you want to know. Although such an improvement is probably necessary, I have my doubts about whether it's entirely sufficient - I feel like there might be other improvements we should also make, but I think somebody (or somebodys) who is (or are) at arm's length from the project needs to try it out and provide some input on what would make it more usable. I don't think that it's probably technically hard to implement any good design we might agree on - but figuring out what a good design would look like may not be so easy. By the way, I'm pretty sure that you being stupid is not the problem here. At the same time, I did make an effort to document the things that seemed to me to be the key points. I think I've missed some things and that should be fixed, but that doesn't mean that I didn't make an effort. It just means that code is always easier to understand when you wrote it yourself, which for you is true of the old code in this file and false of the new code. I think that the preexisting code here is pretty hard to understand as things stand - it took me over a year of study to figure out exactly what to change and how. The README is basically just a cut-and-paste of an email you posted to pgsql-hackers, and there is a heck of a lot of complexity there that's not really explained anywhere; you just have to reverse engineer it from what the code does. Moreover, this code has been around for ten years with zero test coverage. If we get some buildfarm members set up with force_parallel_mode=regress and max_parallel_degree>0, the new code will have more test coverage as of that instant than the existing code has ever had. It's wholly unclear how some of the code in deadlock.c has been tested by anyone, anywhere, ever. I wouldn't be surprised if the savedList = false case in TestConfiguration() has been hit zero times outside of somebody manually modifying the source tree to hit it. If that's ever happened, it was probably only on your machine during the development of this code. My point is that I'm completely willing to admit that I have not done everything perfectly here, but I also don't want to overstate the level of perfection coming in. This code is battle-tested and it will be really bad if I've broken that and we don't find it before release, but the existing state of affairs did not make avoiding breakage as straightforward as it might have been. I hope this helps and please let me know what more you think I should do. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 14 February 2016 at 08:05, Robert Haas <robertmhaas@gmail.com> wrote:
First, the overall concept here is that processes can either be a
member of a lock group or a member of no lock group. The concept of a
lock group is formally separate from the concept of a parallel group
created by a ParallelContext, but it is not clear that there will ever
be any other context in which a lock group will be a good idea. It is
not impossible to imagine: for example, suppose you had a group of
backends that all had TCP client connections, and those processes all
wanted to ingest data and stuff it in a table but without allowing any
unrelated process to touch the table, say because it was going to be
inconsistent during the operation and until indexes were afterwards
rebuilt.
The case that comes to mind for me is in logical decoding, for decoding prepared xacts. Being able to make the prepared xact a member of a "lock group" along with the decoding session's xact may provide a solution to the locking-related challenges there.
I haven't looked closely at what's involved in the decoding prepared xact locking issues yet, just an idea.
To do this it'd have to be possible to add an existing session/xact to a lock group (or make it the leader of a new lock group then join that group). Do you think that's practical with your design?
I don't have any plans to implement anything like that but I
felt it was better to keep the concept of a lock group - which is a
group of processes that cooperate so closely that their locks need not
conflict - from the concept of a parallel context - which is a leader
process that is most likely connected to a user plus a bunch of
ephemeral background workers that aren't. That way, if somebody later
wants to try to reuse the lock grouping stuff for something else,
nothing will get in the way of that; if not, no harm done, but keeping
the two things decoupled is at least easier to understand, IMHO.
Yeah, strong +1
Re: Re: [COMMITTERS] pgsql: Introduce group locking to prevent parallel processes from deadl
От
Stephen Frost
Дата:
* Craig Ringer (craig@2ndquadrant.com) wrote: > On 14 February 2016 at 08:05, Robert Haas <robertmhaas@gmail.com> wrote: > > First, the overall concept here is that processes can either be a > > member of a lock group or a member of no lock group. The concept of a > > lock group is formally separate from the concept of a parallel group > > created by a ParallelContext, but it is not clear that there will ever > > be any other context in which a lock group will be a good idea. It is > > not impossible to imagine: for example, suppose you had a group of > > backends that all had TCP client connections, and those processes all > > wanted to ingest data and stuff it in a table but without allowing any > > unrelated process to touch the table, say because it was going to be > > inconsistent during the operation and until indexes were afterwards > > rebuilt. > > The case that comes to mind for me is in logical decoding, for decoding > prepared xacts. Being able to make the prepared xact a member of a "lock > group" along with the decoding session's xact may provide a solution to the > locking-related challenges there. I was thinking this would be the way to address the current issues with parallel pg_dump and having a shared snpashot. > I haven't looked closely at what's involved in the decoding prepared xact > locking issues yet, just an idea. Yeah, don't have much more than it being an idea to use this for the pg_dump case. I do know that's a case which has been brought up a couple of times before. > To do this it'd have to be possible to add an existing session/xact to a > lock group (or make it the leader of a new lock group then join that > group). Do you think that's practical with your design? Seems like the same question applies to the pg_dump case. Thanks! Stephen
On Sat, Feb 13, 2016 at 9:20 PM, Craig Ringer <craig@2ndquadrant.com> wrote: > The case that comes to mind for me is in logical decoding, for decoding > prepared xacts. Being able to make the prepared xact a member of a "lock > group" along with the decoding session's xact may provide a solution to the > locking-related challenges there. > > I haven't looked closely at what's involved in the decoding prepared xact > locking issues yet, just an idea. > > To do this it'd have to be possible to add an existing session/xact to a > lock group (or make it the leader of a new lock group then join that group). > Do you think that's practical with your design? I doubt it. It's only safe to join a locking group if you don't yet hold any heavyweight locks. I'm not going to say it'd be impossible to lift that restriction, but it'd be pretty complex, because doing so could either create or remove deadlocks that didn't exist before. For example, suppose A wanting AccessExclusiveLock waits for B wanting AccessExclusvieLock waits for C holding AccessShareLock. Then, C joins A's lock group. If A's lock request can't be granted immediately - say D also holds AccessShareLock on the object - this is a deadlock. Moreover, C can't detect the deadlock in the normal course of things because C is not waiting. Sorting this out does not sound simple. It could possibly work if the decoding transaction holds no locks at all, joins the prepared xact's locking group, does stuff, and then, when it again reaches a point where it holds no locks, leaves the lock group. I wonder, though, what happens if you deadlock. The decoding transaction get killed, but you can't kill the prepared transaction, so any locks it held would be retained. Maybe that's OK, but I have a sneaking suspicion there might be situations where we kill the decoding transaction without resolving the deadlock. Sharing locks with a prepared transaction is not really what this was designed for. I don't really understand what problem you are trying to solve here, but I suspect there is a better solution than group locking. The thing is, in the normal course of events, heavyweight locking prevents a lot of bad stuff from happening. When you become a member of a lock group, you're on your own recognizance to prevent that bad stuff. The parallel code does that (or hopefully does that, anyway) by imposing severe restrictions on what you can do while in parallel mode; those restrictions include "no writes whatsoever" and "no DDL". If you wanted to allow either of those things, you would need to think very, very carefully about that, and then if you decided that it was going to be safe, you'd need to think carefully about it a second time. As I mentioned to Simon on another thread a while back, Thomas Munro is working on a hash table that uses dynamic shared memory, and as part of that work, he is taking the allocator work that I did a year or two ago and turning that into a full-fledged allocator for dynamic shared memory. Once we have a real allocator and a real hash table for DSM, I believe we'll be able to solve some of the problems that currently require that parallel query - and probably anything that uses group locking - be strictly read-only. For example, we can use that infrastructure to store a combo CID hash table that can grow arbitrarily in a data structure that all cooperating processes can share. Unfortunately, it does not look like that work will be ready in time to be included in PostgreSQL 9.6. I think he will be in a position to submit it in time for PostgreSQL 9.7, though. >> I don't have any plans to implement anything like that but I >> felt it was better to keep the concept of a lock group - which is a >> group of processes that cooperate so closely that their locks need not >> conflict - from the concept of a parallel context - which is a leader >> process that is most likely connected to a user plus a bunch of >> ephemeral background workers that aren't. That way, if somebody later >> wants to try to reuse the lock grouping stuff for something else, >> nothing will get in the way of that; if not, no harm done, but keeping >> the two things decoupled is at least easier to understand, IMHO. > > Yeah, strong +1 Thanks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > First, the overall concept here is that processes can either be a > member of a lock group or a member of no lock group. Check. > Second, I can review the data structure changes and the associated > invariants. The data structure additions seemed relatively straightforward, though I did wonder why you added groupLeader to PROCLOCK. Why is that not redundant with tag.myProc->lockGroupLeader? If it isn't redundant, it's inadequately documented. If it is, seems better to lose it. Also, isn't the comment on this PGPROC field incorrect: PGPROC *lockGroupLeader; /* lock group leader, if I'm a follower */ ie shouldn't you s/follower/member/ ? > ... the lock manager lock that protects the fields in a > given PGPROC is chosen by taking pgprocno modulo the number of lock > manager partitions. pgprocno of the specific PGPROC, or of the group leader? If it's the former, I'm pretty suspicious that there are deadlock and/or linked-list-corruption hazards here. If it's the latter, at least the comments around this are misleading. > Each PROCLOCK now has a new groupLeader flag. While a PGPROC's > lockGroupLeader may be NULL if that process is not involved in group > locking, a PROCLOCK's groupLeader always has a non-NULL value; it > points back to the PGPROC that acquired the lock. This is not what the comment on it says; and your prose explanation here implies that it should actually be equal to tag.myProc, or else you are using some definition of "acquired the lock" that I don't follow at all. There could be lots of PGPROCs that have acquired a lock in one sense or another; which one do you mean? > With respect to pg_locks - and for that matter also pg_stat_activity - > I think you are right that improvement is needed. This is really the core of my concern at the moment. I think that isolationtester is probably outright broken for any situation where the queries-under-test are being parallel executed, and so will be any other client that's trying to identify who blocks whom from pg_locks. > The simplest thing we could do to make that easier is, in > pg_stat_activity, have parallel workers advertise the PID that > launched them in a new field; and in pg_locks, have members of a lock > group advertise the leader's PID in a new field. That would be simple for us, but it would break every existing client-side query that tries to identify blockers/blockees; and not only would those queries need work but they would become very substantially more complex and slower (probably at least 4-way joins not 2-way). We already know that isolationtester's query has performance problems in the buildfarm. I think more thought is needed here, and that this area should be considered a release blocker. It's certainly a blocker for any thought of turning parallel query on by default. > I hope this helps and please let me know what more you think I should do. At the least you need to make another pass over the comments and README addition. I highlighted above a few critical inaccuracies, and it's not hard to find a lot of less-critical typos in the comments in that commit. Transposing some of what you've written here into the README would likely be a good thing too. regards, tom lane
On Sun, Feb 14, 2016 at 1:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> First, the overall concept here is that processes can either be a >> member of a lock group or a member of no lock group. > > Check. > >> Second, I can review the data structure changes and the associated >> invariants. > > The data structure additions seemed relatively straightforward, though > I did wonder why you added groupLeader to PROCLOCK. Why is that not > redundant with tag.myProc->lockGroupLeader? If it isn't redundant, > it's inadequately documented. If it is, seems better to lose it. That's a good question. There are locking considerations: the lock manager lock for a given PROCLOCK isn't necessarily the same one that protects tag.myProc->lockGroupLeader. I can go and look back through the contexts where we use the PROCLOCK field and see if we can get by without this, but I suspect it's too much of a pain. > Also, isn't the comment on this PGPROC field incorrect: > > PGPROC *lockGroupLeader; /* lock group leader, if I'm a follower */ > > ie shouldn't you s/follower/member/ ? Yes, that would be better. >> ... the lock manager lock that protects the fields in a >> given PGPROC is chosen by taking pgprocno modulo the number of lock >> manager partitions. > > pgprocno of the specific PGPROC, or of the group leader? If it's > the former, I'm pretty suspicious that there are deadlock and/or > linked-list-corruption hazards here. If it's the latter, at least > the comments around this are misleading. Leader. The other way would be nuts. >> Each PROCLOCK now has a new groupLeader flag. While a PGPROC's >> lockGroupLeader may be NULL if that process is not involved in group >> locking, a PROCLOCK's groupLeader always has a non-NULL value; it >> points back to the PGPROC that acquired the lock. > > This is not what the comment on it says; and your prose explanation > here implies that it should actually be equal to tag.myProc, or else > you are using some definition of "acquired the lock" that I don't > follow at all. There could be lots of PGPROCs that have acquired > a lock in one sense or another; which one do you mean? The PROCLOCK's groupLeader will be equal to tag.myProc->lockGroupLeader unless that is NULL. In that case it will be equal to tag.myProc. Or at least that's the intention, modulo bugs. >> With respect to pg_locks - and for that matter also pg_stat_activity - >> I think you are right that improvement is needed. > > This is really the core of my concern at the moment. I think that > isolationtester is probably outright broken for any situation where the > queries-under-test are being parallel executed, and so will be any other > client that's trying to identify who blocks whom from pg_locks. OK. >> The simplest thing we could do to make that easier is, in >> pg_stat_activity, have parallel workers advertise the PID that >> launched them in a new field; and in pg_locks, have members of a lock >> group advertise the leader's PID in a new field. > > That would be simple for us, but it would break every existing client-side > query that tries to identify blockers/blockees; and not only would those > queries need work but they would become very substantially more complex > and slower (probably at least 4-way joins not 2-way). We already know > that isolationtester's query has performance problems in the buildfarm. > I think more thought is needed here, and that this area should be > considered a release blocker. It's certainly a blocker for any thought > of turning parallel query on by default. I don't quite see why this would cause the joins to get more complex. It's also not really clear what the alternative is. You could show all the locks under the leader's PID, but that's horribly confusing because there might be duplicates, and because if somebody's waiting you really want to be able to tell which process to target with gdb. You can show everybody under their own PIDs as now, but then you can't tell which processes are in groups together. So I don't see what there is except to show both values. Maybe there's some whole-new way of displaying this information that would be more clear but I don't really see what it is. >> I hope this helps and please let me know what more you think I should do. > > At the least you need to make another pass over the comments and README > addition. I highlighted above a few critical inaccuracies, and it's not > hard to find a lot of less-critical typos in the comments in that commit. > Transposing some of what you've written here into the README would likely > be a good thing too. I'm getting on a long plane flight shortly but go back over this when I can find some time to do that in an appropriately careful way. Please feel free to make such corrections to comments and so forth as may seem urgent to you in the meanwhile. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Sun, Feb 14, 2016 at 1:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> ... the lock manager lock that protects the fields in a >>> given PGPROC is chosen by taking pgprocno modulo the number of lock >>> manager partitions. >> pgprocno of the specific PGPROC, or of the group leader? If it's >> the former, I'm pretty suspicious that there are deadlock and/or >> linked-list-corruption hazards here. If it's the latter, at least >> the comments around this are misleading. > Leader. The other way would be nuts. On closer inspection, that's actually still somewhat problematic, because that essentially means that no one can inspect another backend's lockGroupLeader etc fields unless they take *all* the lock partition LWLocks first. You can't take just the relevant one because you don't know which one that is, at least not without dereferencing lockGroupLeader which is entirely unsafe if you haven't got the appropriate lock. This isn't a problem for members of the lock group, since they already know who the leader is and hence which partition lock to use. And it's not a problem for the deadlock detector either since it'll take all those partition locks anyway. But it makes life difficult for status inspection code. I'm finding that the only way to write a lock-group-aware function that tells whether A is blocked by B is to hold both the ProcArrayLock and *all* of the lock partition locks throughout inspection of the data structures. I'd hoped to be able to restrict it to locking just the partition holding the lock A is blocked on, but that ain't working. Maybe this is all okay, but it makes me nervous that the data structures may have been over-optimized. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > On Sun, Feb 14, 2016 at 1:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> pgprocno of the specific PGPROC, or of the group leader? If it's >> the former, I'm pretty suspicious that there are deadlock and/or >> linked-list-corruption hazards here. If it's the latter, at least >> the comments around this are misleading. > Leader. The other way would be nuts. ... and btw, either BecomeLockGroupMember() has got this backwards, or I'm still fundamentally confused. Also, after fixing that it would be good to add a comment explaining why it's not fundamentally unsafe for BecomeLockGroupMember() to examine leader->pgprocno without having any relevant lock. AFAICS, that's only okay because the pgprocno fields are never changed after InitProcGlobal, even when a PGPROC is recycled. regards, tom lane
On Sun, Feb 14, 2016 at 5:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Sun, Feb 14, 2016 at 1:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> pgprocno of the specific PGPROC, or of the group leader? If it's >>> the former, I'm pretty suspicious that there are deadlock and/or >>> linked-list-corruption hazards here. If it's the latter, at least >>> the comments around this are misleading. > >> Leader. The other way would be nuts. > > ... and btw, either BecomeLockGroupMember() has got this backwards, or > I'm still fundamentally confused. Yeah, you're right. Attached is a draft patch that tries to clean up that and a bunch of other things that you raised. It hasn't really been tested yet, so it might be buggy; I wrote it during my long plane flight. Fixes include: 1. It removes the groupLeader flag from PGPROC. You're right: we don't need it. 2. It fixes this problem with BecomeLockGroupMember(). You're right: the old way was backwards. 3. It fixes TopoSort() to be less inefficient by checking whether the EDGE is for the correct lock before doing anything else. I realized this while updating comments related to EDGE. 4. It adds some text to the lmgr README. 5. It reflows the existing text in the lmgr README to fit within 80 columns. 6. It adjusts some other comments. Possibly this should be broken up into multiple patches, but I'm just sending it along all together for now. > Also, after fixing that it would be good to add a comment explaining why > it's not fundamentally unsafe for BecomeLockGroupMember() to examine > leader->pgprocno without having any relevant lock. AFAICS, that's only > okay because the pgprocno fields are never changed after InitProcGlobal, > even when a PGPROC is recycled. I am sort of disinclined to think that this needs a comment. Do we really have a comment every other place that pgprocno is referenced without a lock? Or maybe there are none, but it seems like overkill to me to comment on that at every site of use. Better to comment on the pgprocno definition itself and say "never changes". -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
Robert Haas <robertmhaas@gmail.com> writes: > On Sun, Feb 14, 2016 at 5:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Also, after fixing that it would be good to add a comment explaining why >> it's not fundamentally unsafe for BecomeLockGroupMember() to examine >> leader->pgprocno without having any relevant lock. AFAICS, that's only >> okay because the pgprocno fields are never changed after InitProcGlobal, >> even when a PGPROC is recycled. > I am sort of disinclined to think that this needs a comment. I might not either, except that the entire point of that piece of code is to protect against the possibility that the leader PGPROC vanishes before we can get this lock. Since you've got extensive comments addressing that point, it seems appropriate to explain why this one line doesn't invalidate the whole design ... because it's right on the hairy edge of doing so. If we did something like memset a PGPROC to all zeroes when we free it, which in general would seem like a perfectly reasonable thing to do, this code would be broken (and not in an easy to detect way; it would indeed be indistinguishable from the way in which it's broken right now). > Do we > really have a comment every other place that pgprocno is referenced > without a lock? I suspect strongly that there is no other place that attempts to touch any part of an invalid (freed) PGPROC. If there is, more than likely it's unsafe. I don't have time right now to read the patch in detail, but will look tomorrow. In the meantime, thanks for addressing my concerns. regards, tom lane
On Mon, Feb 15, 2016 at 7:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Sun, Feb 14, 2016 at 5:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Also, after fixing that it would be good to add a comment explaining why >>> it's not fundamentally unsafe for BecomeLockGroupMember() to examine >>> leader->pgprocno without having any relevant lock. AFAICS, that's only >>> okay because the pgprocno fields are never changed after InitProcGlobal, >>> even when a PGPROC is recycled. > >> I am sort of disinclined to think that this needs a comment. > > I might not either, except that the entire point of that piece of code is > to protect against the possibility that the leader PGPROC vanishes before > we can get this lock. Since you've got extensive comments addressing that > point, it seems appropriate to explain why this one line doesn't invalidate > the whole design ... because it's right on the hairy edge of doing so. > If we did something like memset a PGPROC to all zeroes when we free it, > which in general would seem like a perfectly reasonable thing to do, this > code would be broken (and not in an easy to detect way; it would indeed > be indistinguishable from the way in which it's broken right now). OK. Well, I'm happy to have you add a comment in a patch of your own, or suggest something to include in mine. I'm less sure I can write something independently that you'll like. >> Do we >> really have a comment every other place that pgprocno is referenced >> without a lock? > > I suspect strongly that there is no other place that attempts to touch any > part of an invalid (freed) PGPROC. If there is, more than likely it's > unsafe. > > I don't have time right now to read the patch in detail, but will look > tomorrow. In the meantime, thanks for addressing my concerns. Sure thing. I appreciate the review. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Yeah, you're right. Attached is a draft patch that tries to clean up > that and a bunch of other things that you raised. I've been reviewing this patch, and I had a question: why do we need to bother with a lockGroupLeaderIdentifier field? It seems totally redundant with the leader's pid field, ie, why doesn't BecomeLockGroupMember simply compare leader->pid with the PID it's passed? For some more safety, it could also verify that leader->lockGroupLeader == leader; but I don't see what the extra field is buying us. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > Yeah, you're right. Attached is a draft patch that tries to clean up > that and a bunch of other things that you raised. I reviewed this patch. I don't have any particular comment on the changes in deadlock.c; I haven't studied the code closely enough to know if those are right. I did think that the commentary elsewhere could be improved some more, so attached is a delta patch on top of yours that shows what I suggest. I've not done anything here about removing lockGroupLeaderIdentifier, although I will be happy to send a patch for that unless you know of some reason I missed why it's necessary. regards, tom lane diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README index 8eaa91c..5a62c8f 100644 *** a/src/backend/storage/lmgr/README --- b/src/backend/storage/lmgr/README *************** acquire an AccessShareLock while the oth *** 603,609 **** This might seem dangerous and could be in some cases (more on that below), but if we didn't do this then parallel query would be extremely prone to self-deadlock. For example, a parallel query against a relation on which the ! leader had already AccessExclusiveLock would hang, because the workers would try to lock the same relation and be blocked by the leader; yet the leader can't finish until it receives completion indications from all workers. An undetected deadlock results. This is far from the only scenario where such a --- 603,609 ---- This might seem dangerous and could be in some cases (more on that below), but if we didn't do this then parallel query would be extremely prone to self-deadlock. For example, a parallel query against a relation on which the ! leader already had AccessExclusiveLock would hang, because the workers would try to lock the same relation and be blocked by the leader; yet the leader can't finish until it receives completion indications from all workers. An undetected deadlock results. This is far from the only scenario where such a *************** quickly enough for this interlock to fai *** 664,690 **** A PGPROC's lockGroupLeader is NULL for processes not involved in parallel query. When a process wants to cooperate with parallel workers, it becomes a ! lock group leader, which means setting this field back to point to its own PGPROC. When a parallel worker starts up, it points this field at the leader, with the above-mentioned interlock. The lockGroupMembers field is only used in ! the leader; it is a list of the workers. The lockGroupLink field is used to ! link the leader and all workers into the leader's list. All of these fields ! are protected by the lock manager locks; the lock manager lock that protects ! the lockGroupLeaderIdentifier, lockGroupLeader, and lockGroupMembers fields in ! a given PGPROC is chosen by taking pgprocno modulo the number of lock manager ! partitions. This unusual arrangement has a major advantage: the deadlock ! detector can count on the fact that no lockGroupLeader field can change while ! the deadlock detector is running, because it knows that it holds all the lock ! manager locks. A PGPROC's lockGroupLink is protected by the lock manager ! partition lock for the group of which it is a part. If it's not part of any ! group, this field is unused and can only be examined or modified by the ! process that owns the PGPROC. We institute a further coding rule that a process cannot join or leave a lock group while owning any PROCLOCK. Therefore, given a lock manager lock ! sufficient to examine PROCLOCK *proclock, it also safe to examine ! proclock->tag.myProc->lockGroupLeader (but not the other fields mentioned in ! the previous paragraph). User Locks (Advisory Locks) --------------------------- --- 664,689 ---- A PGPROC's lockGroupLeader is NULL for processes not involved in parallel query. When a process wants to cooperate with parallel workers, it becomes a ! lock group leader, which means setting this field to point to its own PGPROC. When a parallel worker starts up, it points this field at the leader, with the above-mentioned interlock. The lockGroupMembers field is only used in ! the leader; it is a list of the member PGPROCs of the lock group (the leader ! and all workers). The lockGroupLink field is the list link for this list. ! ! All four of these fields are considered to be protected by a lock manager ! partition lock. The partition lock that protects these fields within a given ! lock group is chosen by taking the leader's pgprocno modulo the number of lock ! manager partitions. This unusual arrangement has a major advantage: the ! deadlock detector can count on the fact that no lockGroupLeader field can ! change while the deadlock detector is running, because it knows that it holds ! all the lock manager locks. Also, holding this single lock allows safe ! manipulation of the lockGroupMembers list for the lock group. We institute a further coding rule that a process cannot join or leave a lock group while owning any PROCLOCK. Therefore, given a lock manager lock ! sufficient to examine a PROCLOCK *proclock, it is also safe to examine ! proclock->tag.myProc->lockGroupLeader (but not the other lock-group-related ! fields of the PGPROC). User Locks (Advisory Locks) --------------------------- diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c index d81746d..21f92dd 100644 *** a/src/backend/storage/lmgr/deadlock.c --- b/src/backend/storage/lmgr/deadlock.c *************** *** 40,47 **** * is, it will be the leader rather than any other member of the lock group. * The group leaders act as representatives of the whole group even though * those particular processes need not be waiting at all. There will be at ! * least one member of the group on the wait queue for the given lock, maybe ! * more. */ typedef struct { --- 40,47 ---- * is, it will be the leader rather than any other member of the lock group. * The group leaders act as representatives of the whole group even though * those particular processes need not be waiting at all. There will be at ! * least one member of the waiter's lock group on the wait queue for the given ! * lock, maybe more. */ typedef struct { diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c index cff8c08..deb88c7 100644 *** a/src/backend/storage/lmgr/proc.c --- b/src/backend/storage/lmgr/proc.c *************** ProcKill(int code, Datum arg) *** 826,835 **** leader->lockGroupLeader = NULL; if (leader != MyProc) { - procgloballist = leader->procgloballist; - /* Leader exited first; return its PGPROC. */ SpinLockAcquire(ProcStructLock); leader->links.next = (SHM_QUEUE *) *procgloballist; *procgloballist = leader; SpinLockRelease(ProcStructLock); --- 826,834 ---- leader->lockGroupLeader = NULL; if (leader != MyProc) { /* Leader exited first; return its PGPROC. */ SpinLockAcquire(ProcStructLock); + procgloballist = leader->procgloballist; leader->links.next = (SHM_QUEUE *) *procgloballist; *procgloballist = leader; SpinLockRelease(ProcStructLock); *************** ProcSleep(LOCALLOCK *locallock, LockMeth *** 1004,1010 **** int i; /* ! * If group locking is in use, locks held my members of my locking group * need to be included in myHeldLocks. */ if (leader != NULL) --- 1003,1009 ---- int i; /* ! * If group locking is in use, locks held by members of my locking group * need to be included in myHeldLocks. */ if (leader != NULL) *************** BecomeLockGroupMember(PGPROC *leader, in *** 1802,1810 **** /* PID must be valid. */ Assert(pid != 0); ! /* Try to join the group. */ leader_lwlock = LockHashPartitionLockByProc(leader); LWLockAcquire(leader_lwlock, LW_EXCLUSIVE); if (leader->lockGroupLeaderIdentifier == pid) { ok = true; --- 1801,1817 ---- /* PID must be valid. */ Assert(pid != 0); ! /* ! * Get lock protecting the group fields. Note LockHashPartitionLockByProc ! * accesses leader->pgprocno in a PGPROC that might be free. This is safe ! * because all PGPROCs' pgprocno fields are set during shared memory ! * initialization and never change thereafter; so we will acquire the ! * correct lock even if the leader PGPROC is in process of being recycled. ! */ leader_lwlock = LockHashPartitionLockByProc(leader); LWLockAcquire(leader_lwlock, LW_EXCLUSIVE); + + /* Try to join the group */ if (leader->lockGroupLeaderIdentifier == pid) { ok = true; diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h index a3e19e1..2053bfe 100644 *** a/src/include/storage/lock.h --- b/src/include/storage/lock.h *************** typedef enum *** 477,486 **** * by one of the lock hash partition locks. Since the deadlock detector * acquires all such locks anyway, this makes it safe for it to access these * fields without doing anything extra. To avoid contention as much as ! * possible, we map different PGPROCs to different partition locks. */ ! #define LockHashPartitionLockByProc(p) \ ! LockHashPartitionLock((p)->pgprocno) /* * function prototypes --- 477,487 ---- * by one of the lock hash partition locks. Since the deadlock detector * acquires all such locks anyway, this makes it safe for it to access these * fields without doing anything extra. To avoid contention as much as ! * possible, we map different PGPROCs to different partition locks. The lock ! * used for a given lock group is determined by the group leader's pgprocno. */ ! #define LockHashPartitionLockByProc(leader_pgproc) \ ! LockHashPartitionLock((leader_pgproc)->pgprocno) /* * function prototypes diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h index 81bddbb..a9405ce 100644 *** a/src/include/storage/proc.h --- b/src/include/storage/proc.h *************** struct PGPROC *** 152,158 **** */ TransactionId procArrayGroupMemberXid; ! /* Per-backend LWLock. Protects fields below. */ LWLock backendLock; /* Lock manager data, recording fast-path locks taken by this backend. */ --- 152,158 ---- */ TransactionId procArrayGroupMemberXid; ! /* Per-backend LWLock. Protects fields below (but not group fields). */ LWLock backendLock; /* Lock manager data, recording fast-path locks taken by this backend. */ *************** struct PGPROC *** 163,173 **** * lock */ /* ! * Support for lock groups. Use LockHashPartitionLockByProc to get the ! * LWLock protecting these fields. */ int lockGroupLeaderIdentifier; /* MyProcPid, if I'm a leader */ ! PGPROC *lockGroupLeader; /* lock group leader, if I'm a follower */ dlist_head lockGroupMembers; /* list of members, if I'm a leader */ dlist_node lockGroupLink; /* my member link, if I'm a member */ }; --- 163,173 ---- * lock */ /* ! * Support for lock groups. Use LockHashPartitionLockByProc on the group ! * leader to get the LWLock protecting these fields. */ int lockGroupLeaderIdentifier; /* MyProcPid, if I'm a leader */ ! PGPROC *lockGroupLeader; /* lock group leader, if I'm a member */ dlist_head lockGroupMembers; /* list of members, if I'm a leader */ dlist_node lockGroupLink; /* my member link, if I'm a member */ };
Robert Haas <robertmhaas@gmail.com> writes: > 1. It removes the groupLeader flag from PGPROC. You're right: we don't need it. It occurs to me that this claim is bogus: > We institute a further coding rule that a process cannot join or leave a lock > group while owning any PROCLOCK. Therefore, given a lock manager lock > sufficient to examine PROCLOCK *proclock, it also safe to examine > proclock->tag.myProc->lockGroupLeader (but not the other fields mentioned in > the previous paragraph). Yes, we can legislate that workers have to join before taking any lock, and if processes only leave the group at death then that end of it is not a problem either; but it is patently not the case that a process will hold no locks when it calls BecomeLockGroupLeader(). We may be able to salvage this simplification anyway, but it will require a tighter specification of when it's safe to look at proclock->tag.myProc->lockGroupLeader. Another possibility is to forget about executor-time BecomeLockGroupLeader(), and just say that any process that isn't a worker always becomes its own group leader at startup. Then the above para is true as written. I don't know what downsides this might have; do your changes in the lock manager try to optimize the null-lockGroupLeader case? regards, tom lane
On 14 February 2016 at 08:05, Robert Haas <robertmhaas@gmail.com> wrote:
The concept of a
lock group is formally separate from the concept of a parallel group
created by a ParallelContext, but it is not clear that there will ever
be any other context in which a lock group will be a good idea.
Just coming back to this in terms of what Stephen and I raised: Robert, do you think this design as it stands can handle cases where a normal standalone backend gets promoted to a lock-group leader that others can then join?
On Tue, Feb 16, 2016 at 8:37 PM, Craig Ringer <craig@2ndquadrant.com> wrote: > On 14 February 2016 at 08:05, Robert Haas <robertmhaas@gmail.com> wrote: >> The concept of a >> lock group is formally separate from the concept of a parallel group >> created by a ParallelContext, but it is not clear that there will ever >> be any other context in which a lock group will be a good idea. > > Just coming back to this in terms of what Stephen and I raised: Robert, do > you think this design as it stands can handle cases where a normal > standalone backend gets promoted to a lock-group leader that others can then > join? That's the exact intended purpose of it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Feb 16, 2016 at 3:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> 1. It removes the groupLeader flag from PGPROC. You're right: we don't need it. > > It occurs to me that this claim is bogus: > >> We institute a further coding rule that a process cannot join or leave a lock >> group while owning any PROCLOCK. Therefore, given a lock manager lock >> sufficient to examine PROCLOCK *proclock, it also safe to examine >> proclock->tag.myProc->lockGroupLeader (but not the other fields mentioned in >> the previous paragraph). > > Yes, we can legislate that workers have to join before taking any lock, > and if processes only leave the group at death then that end of it is not > a problem either; but it is patently not the case that a process will hold > no locks when it calls BecomeLockGroupLeader(). > > We may be able to salvage this simplification anyway, but it will require > a tighter specification of when it's safe to look at > proclock->tag.myProc->lockGroupLeader. > > Another possibility is to forget about executor-time > BecomeLockGroupLeader(), and just say that any process that isn't a worker > always becomes its own group leader at startup. Then the above para is > true as written. I don't know what downsides this might have; do your > changes in the lock manager try to optimize the null-lockGroupLeader case? Hmm, that's true. I don't think it actually matters all that much, because proclock->tag.myProc->lockGroupLeader == NULL has pretty much the same effect as if proclock->tag.myProc->lockGroupLeader == proclock->tag.myProc. But not completely. One problem is that we don't currently assume that 8-byte writes are atomic, so somebody might see the group leader field half-set, which would be bad. But, yes, there are some optimizations for that case. For example, in LockCheckConflicts: /* If no group locking, it's definitely a conflict. */ if (leader == NULL) { PROCLOCK_PRINT("LockCheckConflicts:conflicting (simple)", proclock); return STATUS_FOUND; } ProcSleep also has a check of this sort. In theory those optimizations are quite important, because they can avoid extra passes over the lock queue which theoretically turn O(N) algorithms into O(N^2) or O(N^2) into O(N^3) or whatever. But in practice I'm not sure that the lock queues are ever long enough for that to become an issue. And if they are, the fact that your system is lock-bound is probably a much bigger problem that a few extra CPU cycles inside the lock manager to figure that out. But I don't want to be too flip about that analysis - there might be some scenario where the extra cycles do hurt. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Hmm, that's true. I don't think it actually matters all that much, > because proclock->tag.myProc->lockGroupLeader == NULL has pretty much > the same effect as if proclock->tag.myProc->lockGroupLeader == > proclock->tag.myProc. But not completely. One problem is that we > don't currently assume that 8-byte writes are atomic, so somebody > might see the group leader field half-set, which would be bad. Yes, exactly. I'm not certain if there are any real platforms where a pointer-sized write wouldn't be atomic (it sure sounds inefficient for that to be true), but we have not assumed that to date and I'd just as soon not start here. regards, tom lane
On Wed, Feb 17, 2016 at 10:03 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Hmm, that's true. I don't think it actually matters all that much, >> because proclock->tag.myProc->lockGroupLeader == NULL has pretty much >> the same effect as if proclock->tag.myProc->lockGroupLeader == >> proclock->tag.myProc. But not completely. One problem is that we >> don't currently assume that 8-byte writes are atomic, so somebody >> might see the group leader field half-set, which would be bad. > > Yes, exactly. I'm not certain if there are any real platforms where > a pointer-sized write wouldn't be atomic (it sure sounds inefficient > for that to be true), but we have not assumed that to date and I'd > just as soon not start here. I guess it's worth pointing out that storing the groupLeader in the PROCLOCK, as we do currently, avoids this issue entirely. The deadlock detector can safely follow lockGroupLeader pointers because it holds all the lock manager partition locks, and proc.c/lock.c don't need to do so if groupLeader is present in the PROCLOCK. Nor does that field ever need to be updated, because it's defined to always be non-NULL: if a process becomes a group leader, the value stored in the PROCLOCKs does not change. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: [COMMITTERS] pgsql: Introduce group locking to prevent parallel processes from deadl
От
Andres Freund
Дата:
On 2016-02-16 23:33:47 -0500, Tom Lane wrote: > Yes, exactly. I'm not certain if there are any real platforms where > a pointer-sized write wouldn't be atomic (it sure sounds inefficient > for that to be true), but we have not assumed that to date and I'd > just as soon not start here. FWIW, there's no sizeof(void*) == 8 platform supported by PG where aligned 8 byte writes aren't atomic. There are a number of supported platforms with sizeof(void*) == 4 without atomic 8 byte writes though.
Robert Haas <robertmhaas@gmail.com> writes: > I guess it's worth pointing out that storing the groupLeader in the > PROCLOCK, as we do currently, avoids this issue entirely. Right, that's why I framed it as "can we salvage this simplification" (referring to removal of that field). It may be best to just fix the existing bug and docs deficiencies before returning to that idea. regards, tom lane
On Mon, Feb 15, 2016 at 12:03 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Robert Haas <robertmhaas@gmail.com> writes:
>
> > With respect to pg_locks - and for that matter also pg_stat_activity -
> > I think you are right that improvement is needed.
>
> This is really the core of my concern at the moment. I think that
> isolationtester is probably outright broken for any situation where the
> queries-under-test are being parallel executed, and so will be any other
> client that's trying to identify who blocks whom from pg_locks.
>
> > The simplest thing we could do to make that easier is, in
> > pg_stat_activity, have parallel workers advertise the PID that
> > launched them in a new field; and in pg_locks, have members of a lock
> > group advertise the leader's PID in a new field.
>
> Robert Haas <robertmhaas@gmail.com> writes:
>
> > With respect to pg_locks - and for that matter also pg_stat_activity -
> > I think you are right that improvement is needed.
>
> This is really the core of my concern at the moment. I think that
> isolationtester is probably outright broken for any situation where the
> queries-under-test are being parallel executed, and so will be any other
> client that's trying to identify who blocks whom from pg_locks.
>
> > The simplest thing we could do to make that easier is, in
> > pg_stat_activity, have parallel workers advertise the PID that
> > launched them in a new field; and in pg_locks, have members of a lock
> > group advertise the leader's PID in a new field.
> >
The lock information for parallel query which uses one parallel worker and
another unrelated backend trying to acquire Access Exclusive lock on the
table which parallel query is using will be displayed as below by using lock
monitoring query [1]
blocked_pid | blocking_pid | blocked_statement | current_
statement_in_blocking_process
-------------+--------------+-----------------------------------------+---------
-------------------------------
5052 | 3464 | Lock table t1 in Access Exclusive Mode; |
5052 | 4128 | Lock table t1 in Access Exclusive Mode; | select c
ount(*) from t1 where c1 < 10;
(2 rows)
Here, backend 5052 is waiting for acquiring Access Exclusive lock on t1
which is held in Access Share mode by master backend 4128 and parallel
worker 3464. Now, I think it is tricky for user to find what exactly
is going on by using current lock monitoring queries. Do you think by
adding additional leader pid column, we can eliminate duplicity in above
case and all such cases without having additional join or making above
query more complex?
Can we think of not having lock information in pg_locks for parallel worker
for certain cases where parallel worker acquires lock in same
mode on same resource as leader backend?
Currently, pg_stat_activity displays NULL for both query and state for
parallel workers. I think we should call pgstat_report_activity() in
ParallelWorkerMain() or ParallelQueryMain() to display the information.
One thing which needs some thought is what query should we display
for parallel worker, currently while forming query descriptor in parallel
worker we use "<parallel query>", so one idea is to just display the same
or display the query the used by master backend (if currently not
available, then we might need to pass it from master backend).
>
> That would be simple for us, but it would break every existing client-side
> query that tries to identify blockers/blockees; and not only would those
> queries need work but they would become very substantially more complex
> and slower (probably at least 4-way joins not 2-way). We already know
> that isolationtester's query has performance problems in the buildfarm.
> I think more thought is needed here,
> That would be simple for us, but it would break every existing client-side
> query that tries to identify blockers/blockees; and not only would those
> queries need work but they would become very substantially more complex
> and slower (probably at least 4-way joins not 2-way). We already know
> that isolationtester's query has performance problems in the buildfarm.
> I think more thought is needed here,
Agreed. I think before deciding what exactly to add or change in pg_locks
or lock monitoring queries, it might be helpful if we define how we want
to convey the information to user/dba when group of parallel processes
are involved as blockers/blockees. Some of the cases are:
a) a standalone backend waits for group of parallel processes holding
lock in same mode on same resource.
b) group of parallel workers are waiting for lock held by standalone
backend or some other parallel group
c) a standalone backend waits for group of parallel processes holding
lock in different mode on same resource.
and other similar cases.
Before jumping into discussion about solution, I would like to know
do you and or Robert also have above kind of cases in mind where you
want a better way to display information for user or something else?
[1] -
SELECT blocked_locks.pid AS blocked_pid,
blocking_locks.pid AS blocking_pid,
blocked_activity.query AS blocked_statement,
blocking_activity.query AS current_statement_in_blocking_process
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.GRANTED;
I wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Yeah, you're right. Attached is a draft patch that tries to clean up >> that and a bunch of other things that you raised. > I've been reviewing this patch, and I had a question: why do we need to > bother with a lockGroupLeaderIdentifier field? It seems totally redundant > with the leader's pid field, ie, why doesn't BecomeLockGroupMember simply > compare leader->pid with the PID it's passed? For some more safety, it > could also verify that leader->lockGroupLeader == leader; but I don't > see what the extra field is buying us. Here is a proposed patch that gets rid of lockGroupLeaderIdentifier. I concluded that the "leader->lockGroupLeader == leader" test is necessary, because we don't ever reset the pid field of a dead PGPROC until it's re-used. However, with that check, this seems isomorphic to the existing code because lockGroupLeader is reset to NULL at all the same places that lockGroupLeaderIdentifier is reset. So we do not need both of those fields. regards, tom lane diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README index c399618..56b0a12 100644 *** a/src/backend/storage/lmgr/README --- b/src/backend/storage/lmgr/README *************** those cases so that they no longer use h *** 650,676 **** (which is not a crazy idea, given that such lock acquisitions are not expected to deadlock and that heavyweight lock acquisition is fairly slow anyway). ! Group locking adds four new members to each PGPROC: lockGroupLeaderIdentifier, ! lockGroupLeader, lockGroupMembers, and lockGroupLink. The first is simply a ! safety mechanism. A newly started parallel worker has to try to join the ! leader's lock group, but it has no guarantee that the group leader is still ! alive by the time it gets started. We try to ensure that the parallel leader ! dies after all workers in normal cases, but also that the system could survive ! relatively intact if that somehow fails to happen. This is one of the ! precautions against such a scenario: the leader relays its PGPROC and also its ! PID to the worker, and the worker fails to join the lock group unless the ! given PGPROC still has the same PID. We assume that PIDs are not recycled ! quickly enough for this interlock to fail. ! ! A PGPROC's lockGroupLeader is NULL for processes not involved in parallel ! query. When a process wants to cooperate with parallel workers, it becomes a ! lock group leader, which means setting this field to point to its own ! PGPROC. When a parallel worker starts up, it points this field at the leader, ! with the above-mentioned interlock. The lockGroupMembers field is only used in the leader; it is a list of the member PGPROCs of the lock group (the leader and all workers). The lockGroupLink field is the list link for this list. ! All four of these fields are considered to be protected by a lock manager partition lock. The partition lock that protects these fields within a given lock group is chosen by taking the leader's pgprocno modulo the number of lock manager partitions. This unusual arrangement has a major advantage: the --- 650,665 ---- (which is not a crazy idea, given that such lock acquisitions are not expected to deadlock and that heavyweight lock acquisition is fairly slow anyway). ! Group locking adds three new members to each PGPROC: lockGroupLeader, ! lockGroupMembers, and lockGroupLink. A PGPROC's lockGroupLeader is NULL for ! processes not involved in parallel query. When a process wants to cooperate ! with parallel workers, it becomes a lock group leader, which means setting ! this field to point to its own PGPROC. When a parallel worker starts up, it ! points this field at the leader. The lockGroupMembers field is only used in the leader; it is a list of the member PGPROCs of the lock group (the leader and all workers). The lockGroupLink field is the list link for this list. ! All three of these fields are considered to be protected by a lock manager partition lock. The partition lock that protects these fields within a given lock group is chosen by taking the leader's pgprocno modulo the number of lock manager partitions. This unusual arrangement has a major advantage: the *************** change while the deadlock detector is ru *** 679,684 **** --- 668,685 ---- all the lock manager locks. Also, holding this single lock allows safe manipulation of the lockGroupMembers list for the lock group. + We need an additional interlock when setting these fields, because a newly + started parallel worker has to try to join the leader's lock group, but it + has no guarantee that the group leader is still alive by the time it gets + started. We try to ensure that the parallel leader dies after all workers + in normal cases, but also that the system could survive relatively intact + if that somehow fails to happen. This is one of the precautions against + such a scenario: the leader relays its PGPROC and also its PID to the + worker, and the worker fails to join the lock group unless the given PGPROC + still has the same PID and is still a lock group leader. We assume that + PIDs are not recycled quickly enough for this interlock to fail. + + User Locks (Advisory Locks) --------------------------- diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c index 114fba0..00787d4 100644 *** a/src/backend/storage/lmgr/proc.c --- b/src/backend/storage/lmgr/proc.c *************** InitProcess(void) *** 401,407 **** pg_atomic_init_u32(&MyProc->procArrayGroupNext, INVALID_PGPROCNO); /* Check that group locking fields are in a proper initial state. */ - Assert(MyProc->lockGroupLeaderIdentifier == 0); Assert(MyProc->lockGroupLeader == NULL); Assert(dlist_is_empty(&MyProc->lockGroupMembers)); --- 401,406 ---- *************** InitAuxiliaryProcess(void) *** 565,571 **** SwitchToSharedLatch(); /* Check that group locking fields are in a proper initial state. */ - Assert(MyProc->lockGroupLeaderIdentifier == 0); Assert(MyProc->lockGroupLeader == NULL); Assert(dlist_is_empty(&MyProc->lockGroupMembers)); --- 564,569 ---- *************** ProcKill(int code, Datum arg) *** 822,828 **** dlist_delete(&MyProc->lockGroupLink); if (dlist_is_empty(&leader->lockGroupMembers)) { - leader->lockGroupLeaderIdentifier = 0; leader->lockGroupLeader = NULL; if (leader != MyProc) { --- 820,825 ---- *************** BecomeLockGroupLeader(void) *** 1771,1777 **** leader_lwlock = LockHashPartitionLockByProc(MyProc); LWLockAcquire(leader_lwlock, LW_EXCLUSIVE); MyProc->lockGroupLeader = MyProc; - MyProc->lockGroupLeaderIdentifier = MyProcPid; dlist_push_head(&MyProc->lockGroupMembers, &MyProc->lockGroupLink); LWLockRelease(leader_lwlock); } --- 1768,1773 ---- *************** BecomeLockGroupMember(PGPROC *leader, in *** 1795,1800 **** --- 1791,1799 ---- /* Group leader can't become member of group */ Assert(MyProc != leader); + /* Can't already be a member of a group */ + Assert(MyProc->lockGroupLeader == NULL); + /* PID must be valid. */ Assert(pid != 0); *************** BecomeLockGroupMember(PGPROC *leader, in *** 1809,1815 **** LWLockAcquire(leader_lwlock, LW_EXCLUSIVE); /* Try to join the group */ ! if (leader->lockGroupLeaderIdentifier == pid) { ok = true; MyProc->lockGroupLeader = leader; --- 1808,1814 ---- LWLockAcquire(leader_lwlock, LW_EXCLUSIVE); /* Try to join the group */ ! if (leader->pid == pid && leader->lockGroupLeader == leader) { ok = true; MyProc->lockGroupLeader = leader; diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h index a9405ce..dbcdd3f 100644 *** a/src/include/storage/proc.h --- b/src/include/storage/proc.h *************** struct PGPROC *** 166,172 **** * Support for lock groups. Use LockHashPartitionLockByProc on the group * leader to get the LWLock protecting these fields. */ - int lockGroupLeaderIdentifier; /* MyProcPid, if I'm a leader */ PGPROC *lockGroupLeader; /* lock group leader, if I'm a member */ dlist_head lockGroupMembers; /* list of members, if I'm a leader */ dlist_node lockGroupLink; /* my member link, if I'm a member */ --- 166,171 ----