Обсуждение: Poorly thought out code in vacuum
Lately I have noticed buildfarm member jaguar occasionally failing regression tests with ERROR: invalid memory alloc request size 1073741824 during a VACUUM, as for example at http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jaguar&dt=2012-01-04%2023%3A05%3A02 Naturally I supposed it to be a consequence of the CLOBBER_CACHE_ALWAYS option, ie, somebody accessing an already-freed cache entry. I managed to duplicate and debug it here after an afternoon of trying, and it's not actually related to that at all, except perhaps to the extent that CLOBBER_CACHE_ALWAYS slows down some things enormously more than others. The failure comes when a ResourceOwner tries to enlarge its pinned-buffers array to more than 1GB, and that's not a typo: there are umpteen entries for the same buffer, whose PrivateRefCount is 134217727 and trying to go higher. A stack trace soon revealed the culprit, which is this change in lazy_vacuum_heap(): @@ -932,7 +965,8 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats) tblk = ItemPointerGetBlockNumber(&vacrelstats->dead_tuples[tupindex]); buf = ReadBufferExtended(onerel, MAIN_FORKNUM, tblk,RBM_NORMAL, vac_strategy); - LockBufferForCleanup(buf); + if (!ConditionalLockBufferForCleanup(buf)) + continue; tupindex = lazy_vacuum_page(onerel, tblk, buf, tupindex, vacrelstats); /* Now that we'vecompacted the page, record its available space */ introduced in http://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=bbb6e559c4ea0fb4c346beda76736451dc24eb4e The "continue" just results in another iteration of the containing tight loop, and of course that immediately re-pins the same buffer without having un-pinned it. So if someone else sits on their pin of the buffer for long enough, kaboom. We could fix the direct symptom by inserting UnlockReleaseBuffer() in front of the continue, but AFAICS that just makes this into a busy-waiting equivalent of waiting unconditionally, so I don't really see why we should not revert the above fragment altogether. However, I suppose Robert had something more intelligent in mind than a tight loop when the buffer can't be exclusively locked, so maybe there is some other change that should be made here instead. regards, tom lane
On Fri, Jan 6, 2012 at 12:37 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > We could fix the direct symptom by inserting UnlockReleaseBuffer() > in front of the continue, but AFAICS that just makes this into a > busy-waiting equivalent of waiting unconditionally, so I don't really > see why we should not revert the above fragment altogether. However, > I suppose Robert had something more intelligent in mind than a tight > loop when the buffer can't be exclusively locked, so maybe there is > some other change that should be made here instead. Skipping the block completely isn't feasible. The only options are to wait or to do out of order cleaning. The conditional behaviour in vacuum_scan_heap() is much more important that it is here, so just waiting for the lock wouldn't be too bad. Out of order cleaning could be very expensive in terms of I/O and could make that less robust. So I'd say lets just wait normally for the lock. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jan 5, 2012 at 7:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I suppose Robert had something more intelligent in mind than a tight > loop when the buffer can't be exclusively locked, so maybe there is > some other change that should be made here instead. My intention was to skip the tuple, but I failed to notice the unusual way in which this loop iterates. How about something like the attached? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Fri, Jan 6, 2012 at 2:29 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jan 5, 2012 at 7:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I suppose Robert had something more intelligent in mind than a tight >> loop when the buffer can't be exclusively locked, so maybe there is >> some other change that should be made here instead. > > My intention was to skip the tuple, but I failed to notice the unusual > way in which this loop iterates. How about something like the > attached? It solves the waiting issue, but leaves unused tuples in the heap that previously would have been removed. I don't think that is a solution. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jan 6, 2012 at 9:53 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Fri, Jan 6, 2012 at 2:29 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, Jan 5, 2012 at 7:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I suppose Robert had something more intelligent in mind than a tight >>> loop when the buffer can't be exclusively locked, so maybe there is >>> some other change that should be made here instead. >> >> My intention was to skip the tuple, but I failed to notice the unusual >> way in which this loop iterates. How about something like the >> attached? > > It solves the waiting issue, but leaves unused tuples in the heap that > previously would have been removed. > > I don't think that is a solution. Uh, we discussed this before the patch was committed, and you agreed it made sense to do that in the second heap scan just as we do it in the first heap scan. If you now see a problem with that, that's fine, but please explain what the problem is, rather than just saying it's not acceptable for the patch to do the thing that the patch was designed to do. Actually, on further review, I do see another problem: we need to pass the scan_all flag down to lazy_vacuum_heap, and only do this conditionally if that flag is not set. Otherwise we might skip a page but still advance relfrozenxid, which would definitely be bad. But that looks easy to fix, and I don't see any other reason why this would be either unsafe or undesirable. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jan 6, 2012 at 3:28 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jan 6, 2012 at 9:53 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On Fri, Jan 6, 2012 at 2:29 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Thu, Jan 5, 2012 at 7:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>> I suppose Robert had something more intelligent in mind than a tight >>>> loop when the buffer can't be exclusively locked, so maybe there is >>>> some other change that should be made here instead. >>> >>> My intention was to skip the tuple, but I failed to notice the unusual >>> way in which this loop iterates. How about something like the >>> attached? >> >> It solves the waiting issue, but leaves unused tuples in the heap that >> previously would have been removed. >> >> I don't think that is a solution. > > Uh, we discussed this before the patch was committed, and you agreed > it made sense to do that in the second heap scan just as we do it in > the first heap scan. If you now see a problem with that, that's fine, > but please explain what the problem is, rather than just saying it's > not acceptable for the patch to do the thing that the patch was > designed to do. My name is on that patch, so of course I accept responsibility for the earlier decision making. I *have* explained what is wrong. It "leaves unused tuples in the heap that would previously have been removed". More simply: lazy_vacuum_page() does some work and we can't skip that work just because we don't get the lock. Elsewhere in the patch we skipped getting the lock when there was no work to be done. In lazy_vacuum_heap() we only visit pages that need further work, hence every page is important. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jan 6, 2012 at 10:51 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > I *have* explained what is wrong. It "leaves unused tuples in the heap > that would previously have been removed". > > More simply: lazy_vacuum_page() does some work and we can't skip that > work just because we don't get the lock. Elsewhere in the patch we > skipped getting the lock when there was no work to be done. That is not true. We skip vacuumable pages in the first heap pass even if they contain dead tuples, unless scan_all is set or we can get the buffer lock without waiting. > In > lazy_vacuum_heap() we only visit pages that need further work, hence > every page is important. If the system were to crash after the first heap pass and the index vacuuming, but before the second heap pass, nothing bad would happen. The next vacuum would collect those same dead tuples, scan the indexes for them (and not find them, since we already removed them), and then remove them from the heap. We already made a decision that it's a good idea to skip vacuuming a page on which we can't immediately obtain a cleanup lock, because waiting for the cleanup lock can cause the autovacuum worker to get stuck indefinitely, starving the table and in some cases the entire cluster of adequate vacuuming. That logic is just as valid in the second heap pass as it is in the first one. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
I started to wonder how likely it would be that some other process would sit on a buffer pin for so long as to allow 134217727 iterations of ReadBufferExtended, even given the slowdowns associated with CLOBBER_CACHE_ALWAYS. This led to some fruitless searching for possible deadlock conditions, but eventually I realized that there's a much simpler explanation: if PrivateRefCount > 1 then ConditionalLockBufferForCleanup always fails. This means that once ConditionalLockBufferForCleanup has failed once, the currently committed code in lazy_vacuum_heap is guaranteed to loop until it gets a failure in enlarging the ResourceOwner buffer-reference array. Which in turn means that neither of you ever actually exercised the skip case, or you'd have noticed the problem. Nor are the current regression tests well designed to exercise the case. (There might well be failures of this type happening from time to time in autovacuum, but we'd not see any reported failure in the buildfarm, unless we went trawling in postmaster logs.) So at this point I've got serious doubts as to the quality of testing of that whole patch, not just this part. regards, tom lane
On Fri, Jan 6, 2012 at 12:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I started to wonder how likely it would be that some other process would > sit on a buffer pin for so long as to allow 134217727 iterations of > ReadBufferExtended, even given the slowdowns associated with > CLOBBER_CACHE_ALWAYS. This led to some fruitless searching for possible > deadlock conditions, but eventually I realized that there's a much > simpler explanation: if PrivateRefCount > 1 then > ConditionalLockBufferForCleanup always fails. This means that once > ConditionalLockBufferForCleanup has failed once, the currently committed > code in lazy_vacuum_heap is guaranteed to loop until it gets a failure > in enlarging the ResourceOwner buffer-reference array. Which in turn > means that neither of you ever actually exercised the skip case, or > you'd have noticed the problem. Nor are the current regression tests > well designed to exercise the case. (There might well be failures of > this type happening from time to time in autovacuum, but we'd not see > any reported failure in the buildfarm, unless we went trawling in > postmaster logs.) > > So at this point I've got serious doubts as to the quality of testing of > that whole patch, not just this part. I tested the case where we skip a block during the first pass, but I admit that I punted on testing the case where we skip a block during the second pass, because I couldn't think of a good way to exercise it. Any suggestions? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Jan 6, 2012 at 12:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> So at this point I've got serious doubts as to the quality of testing of >> that whole patch, not just this part. > I tested the case where we skip a block during the first pass, but I > admit that I punted on testing the case where we skip a block during > the second pass, because I couldn't think of a good way to exercise > it. Any suggestions? Hack ConditionalLockBufferForCleanup to have a 50% probability of failure regardless of anything else, for instance via static int ctr = 0; if ((++ctr) % 2) return false; regards, tom lane
On Fri, Jan 6, 2012 at 12:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Fri, Jan 6, 2012 at 12:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> So at this point I've got serious doubts as to the quality of testing of >>> that whole patch, not just this part. > >> I tested the case where we skip a block during the first pass, but I >> admit that I punted on testing the case where we skip a block during >> the second pass, because I couldn't think of a good way to exercise >> it. Any suggestions? > > Hack ConditionalLockBufferForCleanup to have a 50% probability of > failure regardless of anything else, for instance via > > static int ctr = 0; > > if ((++ctr) % 2) > return false; Oh, that's brilliant. OK, I'll go try that. Note to self: Try to remember to take that hack back out before committing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jan 6, 2012 at 12:45 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Oh, that's brilliant. OK, I'll go try that. All right, that test does in fact reveal the broken-ness of the current code, and the patch I committed upthread does seem to fix it, so I've committed that. After further reflection, I'm thinking that skipping the *second* heap pass when a cleanup lock is not immediately available is safe even during an anti-wraparound vacuum (i.e. scan_all = true), because the only thing the second pass is doing is changing dead line pointers to unused, and failing to do that doesn't prevent relfrozenxid advancement: the dead line pointer doesn't contain an XID. The next vacuum will clean it up: it'll see that there's a dead line pointer, add that to the list of TIDs to be killed if seen during the index vacuum (it won't be, since we got that far the first time, but vacuum doesn't know or care), and then try again during the second heap pass to mark that dead line pointer unused. This might be worth a code comment, since it's not entirely obvious. At any rate, it seems that the worst thing that happens from skipping a block during the second pass is that we postpone reclaiming a line pointer whose tuple storage has already been recovered, which is pretty far down in the weeds. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company