Обсуждение: PANIC: wrong buffer passed to visibilitymap_clear
Buildfarm members spurfowl[1] and thorntail[2] have each shown $SUBJECT once in the past two days. The circumstances are not quite the same; spurfowl's failure is in autovacuum while thorntail's is in a manual VACUUM command. Still, it seems clear that there's a recently-introduced bug here somewhere. I don't see any obvious candidate for the culprit, though. Any ideas? regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=spurfowl&dt=2021-04-08%2010%3A22%3A08 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=thorntail&dt=2021-04-09%2021%3A28%3A10
On Fri, Apr 9, 2021 at 3:40 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Buildfarm members spurfowl[1] and thorntail[2] have each shown $SUBJECT > once in the past two days. The circumstances are not quite the same; > spurfowl's failure is in autovacuum while thorntail's is in a manual > VACUUM command. Still, it seems clear that there's a recently-introduced > bug here somewhere. I don't see any obvious candidate for the culprit, > though. Any ideas? They're both VACUUM ANALYZE. They must be, because the calls to visibilitymap_clear PANIC (they don't ERROR) -- the failing visibilitymap_clear() call must occur inside a critical section, and all such calls are made within heapam.c (only VACUUM ANALYZE uses a transaction and does writes). It cannot be the two calls to visibilitymap_clear() inside vacuumlazy.c. I suspect that you've figured this much already. Just pointing it out. -- Peter Geoghegan
Hi, On 2021-04-09 18:40:27 -0400, Tom Lane wrote: > Buildfarm members spurfowl[1] and thorntail[2] have each shown $SUBJECT > once in the past two days. The circumstances are not quite the same; > spurfowl's failure is in autovacuum while thorntail's is in a manual > VACUUM command. Still, it seems clear that there's a recently-introduced > bug here somewhere. I don't see any obvious candidate for the culprit, > though. Any ideas? commit 7ab96cf6b312cfcd79cdc1a69c6bdb75de0ed30f Author: Peter Geoghegan <pg@bowt.ie> Date: 2021-04-06 07:49:39 -0700 Refactor lazy_scan_heap() loop. or some of the other changes in the vicinity could be related. There's some changes when pages are marked as AllVisible, when their free space is tracked etc. Just looking at the code in heap_update: I'm a bit confused about RelationGetBufferForTuple()'s vmbuffer and vmbuffer_other arguments. It looks like it's not at all clear which of the two arguments will have the vmbuffer for which of the pages? if (otherBuffer == InvalidBuffer || targetBlock <= otherBlock) GetVisibilityMapPins(relation, buffer, otherBuffer, targetBlock, otherBlock, vmbuffer, vmbuffer_other); else GetVisibilityMapPins(relation, otherBuffer, buffer, otherBlock, targetBlock, vmbuffer_other, vmbuffer); Which then would make any subsequent use of vmbuffer vs vmbuffer_new in heap_update() bogus? Because clearly that code associates vmbuffer / vmbuffer_new with the respective page? /* clear PD_ALL_VISIBLE flags, reset all visibilitymap bits */ if (PageIsAllVisible(BufferGetPage(buffer))) { all_visible_cleared = true; PageClearAllVisible(BufferGetPage(buffer)); visibilitymap_clear(relation, BufferGetBlockNumber(buffer), vmbuffer, VISIBILITYMAP_VALID_BITS); } if (newbuf != buffer && PageIsAllVisible(BufferGetPage(newbuf))) { all_visible_cleared_new = true; PageClearAllVisible(BufferGetPage(newbuf)); visibilitymap_clear(relation, BufferGetBlockNumber(newbuf), vmbuffer_new, VISIBILITYMAP_VALID_BITS); } Greetings, Andres Freund
Hi, On 2021-04-09 16:27:12 -0700, Peter Geoghegan wrote: > They're both VACUUM ANALYZE. They must be, because the calls to > visibilitymap_clear PANIC (they don't ERROR) -- the failing > visibilitymap_clear() call must occur inside a critical section, and > all such calls are made within heapam.c (only VACUUM ANALYZE uses a > transaction and does writes). It cannot be the two calls to > visibilitymap_clear() inside vacuumlazy.c. There's a stacktrace at the bottom of the spurfowl report: ======-=-====== stack trace: pgsql.build/src/test/regress/tmp_check/data/core ======-=-====== [New LWP 24172] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `postgres: autovacuum worker regression '. Program terminated with signal SIGABRT, Aborted. #0 0x00007f77a7967428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54 54 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. #0 0x00007f77a7967428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54 #1 0x00007f77a796902a in __GI_abort () at abort.c:89 #2 0x000000000095cf8d in errfinish (filename=<optimized out>, filename@entry=0x9c3fa0 "visibilitymap.c", lineno=lineno@entry=155,funcname=funcname@entry=0x9c41c0 <__func__.13853> "visibilitymap_clear") at elog.c:680 #3 0x0000000000501498 in visibilitymap_clear (rel=rel@entry=0x7f77a96d2d28, heapBlk=<optimized out>, buf=buf@entry=0, flags=flags@entry=3'\\003') at visibilitymap.c:155 #4 0x00000000004e6380 in heap_update (relation=relation@entry=0x7f77a96d2d28, otid=otid@entry=0x2c0394c, newtup=newtup@entry=0x2c03948,cid=0, crosscheck=crosscheck@entry=0x0, wait=wait@entry=true, tmfd=0x7ffe119d2c20, lockmode=0x7ffe119d2c1c)at heapam.c:3993 #5 0x00000000004e7d70 in simple_heap_update (relation=relation@entry=0x7f77a96d2d28, otid=otid@entry=0x2c0394c, tup=tup@entry=0x2c03948)at heapam.c:4211 #6 0x00000000005811a9 in CatalogTupleUpdate (heapRel=0x7f77a96d2d28, otid=0x2c0394c, tup=0x2c03948) at indexing.c:309 #7 0x00000000005efc32 in update_attstats (relid=16928, inh=inh@entry=false, natts=natts@entry=1, vacattrstats=vacattrstats@entry=0x2b3c030)at analyze.c:1746 #8 0x00000000005f264a in update_attstats (vacattrstats=0x2b3c030, natts=1, inh=false, relid=<optimized out>) at analyze.c:589 #9 do_analyze_rel (onerel=onerel@entry=0x7f77a95c1070, params=params@entry=0x2aba36c, va_cols=va_cols@entry=0x0, acquirefunc=<optimizedout>, relpages=33, inh=inh@entry=false, in_outer_xact=false, elevel=13) at analyze.c:589 #10 0x00000000005f2d8d in analyze_rel (relid=<optimized out>, relation=<optimized out>, params=params@entry=0x2aba36c, va_cols=0x0,in_outer_xact=<optimized out>, bstrategy=<optimized out>) at analyze.c:261 #11 0x0000000000671721 in vacuum (relations=0x2b492b8, params=params@entry=0x2aba36c, bstrategy=<optimized out>, bstrategy@entry=0x2aba4e8,isTopLevel=isTopLevel@entry=true) at vacuum.c:478 #12 0x000000000048f02d in autovacuum_do_vac_analyze (bstrategy=0x2aba4e8, tab=0x2aba368) at autovacuum.c:3316 #13 do_autovacuum () at autovacuum.c:2537 #14 0x0000000000779d76 in AutoVacWorkerMain (argv=0x0, argc=0) at autovacuum.c:1715 #15 0x0000000000779e79 in StartAutoVacWorker () at autovacuum.c:1500 #16 0x0000000000788324 in StartAutovacuumWorker () at postmaster.c:5539 #17 sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5243 #18 <signal handler called> #19 0x00007f77a7a2f5b3 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:84 #20 0x0000000000788668 in ServerLoop () at postmaster.c:1701 #21 0x000000000078a187 in PostmasterMain (argc=argc@entry=8, argv=argv@entry=0x2a408c0) at postmaster.c:1409 #22 0x0000000000490e48 in main (argc=8, argv=0x2a408c0) at main.c:209 $1 = {si_signo = 6, si_errno = 0, si_code = -6, _sifields = {_pad = {24172, 1001, 0 <repeats 26 times>}, _kill = {si_pid= 24172, si_uid = 1001}, _timer = {si_tid = 24172, si_overrun = 1001, si_sigval = {sival_int = 0, sival_ptr = 0x0}},_rt = {si_pid = 24172, si_uid = 1001, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _sigchld = {si_pid = 24172, si_uid= 1001, si_status = 0, si_utime = 0, si_stime = 0}, _sigfault = {si_addr = 0x3e900005e6c, _addr_lsb = 0, _addr_bnd= {_lower = 0x0, _upper = 0x0}}, _sigpoll = {si_band = 4299262287468, si_fd = 0}}} Greetings, Andres Freund
On 2021-04-09 16:27:39 -0700, Andres Freund wrote: > Just looking at the code in heap_update: I'm a bit confused about > RelationGetBufferForTuple()'s vmbuffer and vmbuffer_other > arguments. It looks like it's not at all clear which of the two > arguments will have the vmbuffer for which of the pages? > > if (otherBuffer == InvalidBuffer || targetBlock <= otherBlock) > GetVisibilityMapPins(relation, buffer, otherBuffer, > targetBlock, otherBlock, vmbuffer, > vmbuffer_other); > else > GetVisibilityMapPins(relation, otherBuffer, buffer, > otherBlock, targetBlock, vmbuffer_other, > vmbuffer); Oh, I missed that the arguments to GetVisibilityMapPins are appropriately swapped too. Greetings, Andres Freund
I've managed to reproduce this locally, by dint of running the src/bin/scripts tests over and over and tweaking the timing by trying different "taskset" parameters to vary the number of CPUs available. I find that I duplicated the report from spurfowl, particularly (gdb) bt #0 0x00007f67bb6807d5 in raise () from /lib64/libc.so.6 #1 0x00007f67bb669895 in abort () from /lib64/libc.so.6 #2 0x000000000094ce37 in errfinish (filename=<optimized out>, lineno=<optimized out>, funcname=0x9ac120 <__func__.1> "visibilitymap_clear") at elog.c:680 #3 0x0000000000488b8c in visibilitymap_clear (rel=rel@entry=0x7f67b2837330, heapBlk=<optimized out>, buf=buf@entry=0, flags=flags@entry=3 '\003') ^^^^^^^^^^^^^^^ at visibilitymap.c:155 #4 0x000000000055cd87 in heap_update (relation=0x7f67b2837330, otid=0x7f67b274744c, newtup=0x7f67b2747448, cid=<optimized out>, crosscheck=<optimized out>, wait=<optimized out>, tmfd=0x7ffecf4d5700, lockmode=0x7ffecf4d56fc) at heapam.c:3993 #5 0x000000000055dd61 in simple_heap_update ( relation=relation@entry=0x7f67b2837330, otid=otid@entry=0x7f67b274744c, tup=tup@entry=0x7f67b2747448) at heapam.c:4211 #6 0x00000000005e531c in CatalogTupleUpdate (heapRel=0x7f67b2837330, otid=0x7f67b274744c, tup=0x7f67b2747448) at indexing.c:309 #7 0x00000000006420f9 in update_attstats (relid=1255, inh=false, natts=natts@entry=30, vacattrstats=vacattrstats@entry=0x19c9fc0) at analyze.c:1758 #8 0x00000000006430dd in update_attstats (vacattrstats=0x19c9fc0, natts=30, inh=false, relid=<optimized out>) at analyze.c:1646 #9 do_analyze_rel (onerel=<optimized out>, params=0x7ffecf4d5e50, va_cols=0x0, acquirefunc=<optimized out>, relpages=86, inh=<optimized out>, in_outer_xact=false, elevel=13) at analyze.c:589 #10 0x00000000006447a1 in analyze_rel (relid=<optimized out>, relation=<optimized out>, params=params@entry=0x7ffecf4d5e50, va_cols=0x0, in_outer_xact=<optimized out>, bstrategy=<optimized out>) at analyze.c:261 #11 0x00000000006a5718 in vacuum (relations=0x19c8158, params=0x7ffecf4d5e50, bstrategy=<optimized out>, isTopLevel=<optimized out>) at vacuum.c:478 #12 0x00000000006a5c94 in ExecVacuum (pstate=pstate@entry=0x1915970, vacstmt=vacstmt@entry=0x18ed5c8, isTopLevel=isTopLevel@entry=true) at vacuum.c:254 #13 0x000000000083c32c in standard_ProcessUtility (pstmt=0x18ed918, queryString=0x18eca20 "ANALYZE pg_catalog.pg_proc;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x18eda08, qc=0x7ffecf4d61c0) at utility.c:826 I'd not paid much attention to that point before, but now it seems there is no question that heap_update is reaching line 3993 visibilitymap_clear(relation, BufferGetBlockNumber(buffer), vmbuffer, VISIBILITYMAP_VALID_BITS); without having changed "vmbuffer" from its initial value of InvalidBuffer. It looks that way both at frame 3 and frame 4: (gdb) f 4 #4 0x000000000055cd87 in heap_update (relation=0x7f67b2837330, otid=0x7f67b274744c, newtup=0x7f67b2747448, cid=<optimized out>, crosscheck=<optimized out>, wait=<optimized out>, tmfd=0x7ffecf4d5700, lockmode=0x7ffecf4d56fc) at heapam.c:3993 3993 visibilitymap_clear(relation, BufferGetBlockNumber(buffer), (gdb) i locals ... vmbuffer = 0 vmbuffer_new = 0 ... It is also hard to doubt that somebody broke this in the last-minute commit blizzard. There are no reports of this PANIC in the buildfarm for the last month, but we're now up to four (last I checked) since Thursday. While the first thing that comes to mind is a logic bug in heap_update itself, that code doesn't seem to have changed much in the last few days. Moreover, why is it that this only seems to be happening within do_analyze_rel -> update_attstats? (We only have two stack traces positively showing that, but all the buildfarm reports look like the failure is happening within manual or auto analyze of a system catalog. Fishy as heck.) Just eyeing the evidence on hand, I'm wondering if something has decided it can start setting the page-all-visible bit without adequate lock, perhaps only in system catalogs. heap_update is clearly assuming that that flag won't change underneath it, and if it did, it's clear how this symptom would ensue. Too tired to take it further tonight ... discuss among yourselves. regards, tom lane
On Sat, Apr 10, 2021 at 10:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Just eyeing the evidence on hand, I'm wondering if something has decided > it can start setting the page-all-visible bit without adequate lock, > perhaps only in system catalogs. heap_update is clearly assuming that > that flag won't change underneath it, and if it did, it's clear how this > symptom would ensue. Does this patch seem to fix the problem? -- Peter Geoghegan
Вложения
Peter Geoghegan <pg@bowt.ie> writes: > On Sat, Apr 10, 2021 at 10:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Just eyeing the evidence on hand, I'm wondering if something has decided >> it can start setting the page-all-visible bit without adequate lock, >> perhaps only in system catalogs. heap_update is clearly assuming that >> that flag won't change underneath it, and if it did, it's clear how this >> symptom would ensue. > Does this patch seem to fix the problem? Hmm ... that looks pretty suspicious, I agree, but why wouldn't an exclusive buffer lock be enough to prevent concurrency with heap_update? (I have zero faith in being able to show that this patch fixes the problem by testing, given how hard it is to reproduce. We need to convince ourselves that this is a fix by logic.) regards, tom lane
On Sun, Apr 11, 2021 at 8:57 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Does this patch seem to fix the problem? > > Hmm ... that looks pretty suspicious, I agree, but why wouldn't an > exclusive buffer lock be enough to prevent concurrency with heap_update? I don't have any reason to believe that using a super-exclusive lock during heap page vacuuming is necessary. My guess is that returning to doing it that way might make the buildfarm green again. That would at least confirm my suspicion that this code is relevant. The super-exclusive lock might have been masking the problem for a long time. How about temporarily committing this patch, just to review how it affects the buildfarm? -- Peter Geoghegan
On Sun, Apr 11, 2021 at 9:10 AM Peter Geoghegan <pg@bowt.ie> wrote: > I don't have any reason to believe that using a super-exclusive lock > during heap page vacuuming is necessary. My guess is that returning to > doing it that way might make the buildfarm green again. That would at > least confirm my suspicion that this code is relevant. The > super-exclusive lock might have been masking the problem for a long > time. This isn't just any super-exclusive lock, either -- we were calling ConditionalLockBufferForCleanup() at this point. I now think that there is a good chance that we are seeing these symptoms because the "conditional-ness" went away -- we accidentally relied on that. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > This isn't just any super-exclusive lock, either -- we were calling > ConditionalLockBufferForCleanup() at this point. > I now think that there is a good chance that we are seeing these > symptoms because the "conditional-ness" went away -- we accidentally > relied on that. Ah, I see it. In the fragment of heap_update where we have to do some TOAST work (starting at line 3815) we transiently *release our lock* on the old tuple's page. Unlike the earlier fragments that did that, this code path has no provision for rechecking whether the page has become all-visible, so if that does happen while we're without the lock then we lose. (It does look like RelationGetBufferForTuple knows about updating vmbuffer, but there's one code path through the if-nest at 3850ff that doesn't call that.) So the previous coding in vacuumlazy didn't tickle this because it would only set the all-visible bit on a page it had superexclusive lock on; that is, continuing to hold the pin was sufficient. Nonetheless, if four out of five paths through heap_update take care of this matter, I'd say it's heap_update's bug not vacuumlazy's bug that the fifth path doesn't. regards, tom lane
I wrote: > (It does look like RelationGetBufferForTuple > knows about updating vmbuffer, but there's one code path through the > if-nest at 3850ff that doesn't call that.) Although ... isn't RelationGetBufferForTuple dropping the ball on this point too, in the code path at the end where it has to extend the relation? I'm now inclined to think that we should toss every single line of that code, take RelationGetBufferForTuple out of the equation, and have just *one* place that rechecks for PageAllVisible having just become set. It's a rare enough case that optimizing it is completely not worth the code complexity and risk (er, reality) of hard-to-locate bugs. regards, tom lane
I wrote: > I'm now inclined to think that we should toss every single line of that > code, take RelationGetBufferForTuple out of the equation, and have just > *one* place that rechecks for PageAllVisible having just become set. > It's a rare enough case that optimizing it is completely not worth the > code complexity and risk (er, reality) of hard-to-locate bugs. Alternatively, we could do what you suggested and redefine things so that one is only allowed to set the all-visible bit while holding superexclusive lock; which again would allow an enormous simplification in heap_update and cohorts. Either way, it's hard to argue that heap_update hasn't crossed the complexity threshold where it's impossible to maintain safely. We need to simplify it. regards, tom lane
On Sun, Apr 11, 2021 at 10:55 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alternatively, we could do what you suggested and redefine things > so that one is only allowed to set the all-visible bit while holding > superexclusive lock; which again would allow an enormous simplification > in heap_update and cohorts. Great detective work. I would rather not go back to requiring a superexclusive lock in vacuumlazy.c (outside of pruning), actually -- I was only pointing out that that had changed, and was likely to be relevant. It wasn't a real proposal. I think that it would be hard to justify requiring a super-exclusive lock just to call PageSetAllVisible(). PD_ALL_VISIBLE is fundamentally redundant information, so somehow it feels like the wrong design. > Either way, it's hard to argue that > heap_update hasn't crossed the complexity threshold where it's > impossible to maintain safely. We need to simplify it. It is way too complicated. I don't think that I quite understand your first proposal right now, so I'll need to go think about it. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > On Sun, Apr 11, 2021 at 10:55 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Either way, it's hard to argue that >> heap_update hasn't crossed the complexity threshold where it's >> impossible to maintain safely. We need to simplify it. > It is way too complicated. I don't think that I quite understand your > first proposal right now, so I'll need to go think about it. It wasn't very clear, because I hadn't thought it through very much; but what I'm imagining is that we discard most of the thrashing around all-visible rechecks and have just one such test somewhere very late in heap_update, after we've successfully acquired a target buffer for the update and are no longer going to possibly need to release any buffer lock. If at that one point we see the page is all-visible and we don't have the vmbuffer, then we have to release all our locks and go back to "l2". Which is less efficient than some of the existing code paths, but given how hard this problem is to reproduce, it seems clear that optimizing for the occurrence is just not worth it. regards, tom lane
On Sun, Apr 11, 2021 at 11:16 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > It wasn't very clear, because I hadn't thought it through very much; > but what I'm imagining is that we discard most of the thrashing around > all-visible rechecks and have just one such test somewhere very late > in heap_update, after we've successfully acquired a target buffer for > the update and are no longer going to possibly need to release any > buffer lock. If at that one point we see the page is all-visible > and we don't have the vmbuffer, then we have to release all our locks > and go back to "l2". Which is less efficient than some of the existing > code paths, but given how hard this problem is to reproduce, it seems > clear that optimizing for the occurrence is just not worth it. Oh! That sounds way better. This reminds me of the tupgone case that I exorcised from vacuumlazy.c (in the same commit that stopped using a superexclusive lock). It was also described as an optimization that wasn't quite worth it. But I don't quite buy that. ISTM that there is a better explanation: it evolved the appearance of being an optimization that might make sense. Which was just camouflage. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > On Sun, Apr 11, 2021 at 11:16 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> It wasn't very clear, because I hadn't thought it through very much; >> but what I'm imagining is that we discard most of the thrashing around >> all-visible rechecks and have just one such test somewhere very late >> in heap_update, after we've successfully acquired a target buffer for >> the update and are no longer going to possibly need to release any >> buffer lock. If at that one point we see the page is all-visible >> and we don't have the vmbuffer, then we have to release all our locks >> and go back to "l2". Which is less efficient than some of the existing >> code paths, but given how hard this problem is to reproduce, it seems >> clear that optimizing for the occurrence is just not worth it. > Oh! That sounds way better. After poking at this for awhile, it seems like it won't work very nicely. The problem is that once we've invoked the toaster, we really don't want to just abandon that work; we'd leak any toasted out-of-line data that was created. So I think we have to stick with the current basic design, and just tighten things up to make sure that visibility pins are accounted for in the places that are missing it. Hence, I propose the attached. It passes check-world, but that proves absolutely nothing of course :-(. I wonder if there is any way to exercise these code paths deterministically. (I have realized BTW that I was exceedingly fortunate to reproduce the buildfarm report here --- I have run hundreds of additional cycles of the same test scenario without getting a second failure.) regards, tom lane diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 03d4abc938..265b31e981 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -3784,7 +3784,7 @@ l2: * overhead would be unchanged, that doesn't seem necessarily * worthwhile. */ - if (PageIsAllVisible(BufferGetPage(buffer)) && + if (PageIsAllVisible(page) && visibilitymap_clear(relation, block, vmbuffer, VISIBILITYMAP_ALL_FROZEN)) cleared_all_frozen = true; @@ -3846,36 +3846,46 @@ l2: * first". To implement this, we must do RelationGetBufferForTuple * while not holding the lock on the old page, and we must rely on it * to get the locks on both pages in the correct order. + * + * Another consideration is that we need visibility map page pin(s) if + * we will have to clear the all-visible flag on either page. If we + * call RelationGetBufferForTuple, we rely on it to acquire any such + * pins; but if we don't, we have to handle that here. Hence we need + * a loop. */ - if (newtupsize > pagefree) - { - /* Assume there's no chance to put heaptup on same page. */ - newbuf = RelationGetBufferForTuple(relation, heaptup->t_len, - buffer, 0, NULL, - &vmbuffer_new, &vmbuffer); - } - else + for (;;) { + if (newtupsize > pagefree) + { + /* It doesn't fit, must use RelationGetBufferForTuple. */ + newbuf = RelationGetBufferForTuple(relation, heaptup->t_len, + buffer, 0, NULL, + &vmbuffer_new, &vmbuffer); + /* We're all done. */ + break; + } + /* Acquire VM page pin if needed and we don't have it. */ + if (vmbuffer == InvalidBuffer && PageIsAllVisible(page)) + visibilitymap_pin(relation, block, &vmbuffer); /* Re-acquire the lock on the old tuple's page. */ LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); /* Re-check using the up-to-date free space */ pagefree = PageGetHeapFreeSpace(page); - if (newtupsize > pagefree) + if (newtupsize > pagefree || + (vmbuffer == InvalidBuffer && PageIsAllVisible(page))) { /* - * Rats, it doesn't fit anymore. We must now unlock and - * relock to avoid deadlock. Fortunately, this path should - * seldom be taken. + * Rats, it doesn't fit anymore, or somebody just now set the + * all-visible flag. We must now unlock and loop to avoid + * deadlock. Fortunately, this path should seldom be taken. */ LockBuffer(buffer, BUFFER_LOCK_UNLOCK); - newbuf = RelationGetBufferForTuple(relation, heaptup->t_len, - buffer, 0, NULL, - &vmbuffer_new, &vmbuffer); } else { - /* OK, it fits here, so we're done. */ + /* We're all done. */ newbuf = buffer; + break; } } } diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c index 37a1be4114..ffc89685bf 100644 --- a/src/backend/access/heap/hio.c +++ b/src/backend/access/heap/hio.c @@ -293,9 +293,13 @@ RelationAddExtraBlocks(Relation relation, BulkInsertState bistate) * happen if space is freed in that page after heap_update finds there's not * enough there). In that case, the page will be pinned and locked only once. * - * For the vmbuffer and vmbuffer_other arguments, we avoid deadlock by - * locking them only after locking the corresponding heap page, and taking - * no further lwlocks while they are locked. + * We also handle the possibility that the all-visible flag will need to be + * cleared on one or both pages. If so, pin on the associated visibility map + * page must be acquired before acquiring buffer lock(s), to avoid possibly + * doing I/O while holding buffer locks. The pins are passed back to the + * caller using the input-output arguments vmbuffer and vmbuffer_other. + * Note that in some cases the caller might have already acquired such pins, + * which is indicated by these arguments not being InvalidBuffer on entry. * * We normally use FSM to help us find free space. However, * if HEAP_INSERT_SKIP_FSM is specified, we just append a new empty page to @@ -666,6 +670,8 @@ loop: if (otherBuffer != InvalidBuffer) { Assert(otherBuffer != buffer); + targetBlock = BufferGetBlockNumber(buffer); + Assert(targetBlock > otherBlock); if (unlikely(!ConditionalLockBuffer(otherBuffer))) { @@ -674,10 +680,16 @@ loop: LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); /* - * Because the buffer was unlocked for a while, it's possible, - * although unlikely, that the page was filled. If so, just retry - * from start. + * Because the buffers were unlocked for a while, it's possible, + * although unlikely, that an all-visible flag became set or that + * somebody used up the available space in the new page. We can + * use GetVisibilityMapPins to deal with the first case. In the + * second case, just retry from start. */ + GetVisibilityMapPins(relation, otherBuffer, buffer, + otherBlock, targetBlock, vmbuffer_other, + vmbuffer); + if (len > PageGetHeapFreeSpace(page)) { LockBuffer(otherBuffer, BUFFER_LOCK_UNLOCK);
On Mon, Apr 12, 2021 at 9:19 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > So I think we have to stick with the current basic design, and just > tighten things up to make sure that visibility pins are accounted for > in the places that are missing it. > > Hence, I propose the attached. It passes check-world, but that proves > absolutely nothing of course :-(. I wonder if there is any way to > exercise these code paths deterministically. This approach seems reasonable to me. At least you've managed to structure the visibility map page pin check as concomitant with the existing space recheck. > (I have realized BTW that I was exceedingly fortunate to reproduce > the buildfarm report here --- I have run hundreds of additional > cycles of the same test scenario without getting a second failure.) In the past I've had luck with RR's chaos mode (most notably with the Jepsen SSI bug). That didn't work for me here, though I might just have not persisted with it for long enough. I should probably come up with a shell script that runs the same thing hundreds of times or more in chaos mode, while making sure that useless recordings don't accumulate. The feature is described here: https://robert.ocallahan.org/2016/02/introducing-rr-chaos-mode.html You only have to be lucky once. Once that happens, you're left with a recording to review and re-review at your leisure. This includes all Postgres backends, maybe even pg_regress and other scaffolding (if that's what you're after). But that's for debugging, not testing. The only way that we'll ever be able to test stuff like this is with something like Alexander Korotkov's stop events patch [1]. That infrastructure should be added sooner rather than later. [1] https://postgr.es/m/CAPpHfdtSEOHX8dSk9Qp+Z++i4BGQoffKip6JDWngEA+g7Z-XmQ@mail.gmail.com -- Peter Geoghegan
Hi, On 2021-04-11 13:55:30 -0400, Tom Lane wrote: > Either way, it's hard to argue that heap_update hasn't crossed the > complexity threshold where it's impossible to maintain safely. We > need to simplify it. Yea, I think we're well beyond that point. I can see a few possible steps to wrangle the existing complexity into an easier to understand shape: - Rename heapam.c goto labels, they're useless to understand what is happening. - Move HeapTupleSatisfiesUpdate() call and the related branches afterwards into its own function. - Move "temporarily mark it locked" branch into its own function. It's a minimal implementation of tuple locking, so it seems more than separate enough. - Move the "store the new tuple" part into its own function (pretty much the critical section). - Probably worth unifying the exit paths - there's a fair bit of duplication by now... Half related: - I think we might also need to do something about the proliferation of bitmaps in heap_update(). We now separately allocate 5 bitmapsets - that strikes me as fairly insane. However, these would not really address the complexity in itself, just make it easier to manage. ISTM that a lot of the complexity is related to needing to retry (and avoiding doing so unnecessarily), which in turn is related to avoiding deadlocks. We actually know how to not need that to the same degree - the (need_toast || newtupsize > pagefree) locks the tuple and afterwards has a lot more freedom. We obviously can't just always do that, due to the WAL logging overhead. I wonder if we could make that path avoid the WAL logging overhead. We don't actually need a full blown tuple lock, potentially even with its own multixact, here. The relevant comment (in heap_lock_tuple()) says: /* * XLOG stuff. You might think that we don't need an XLOG record because * there is no state change worth restoring after a crash. You would be * wrong however: we have just written either a TransactionId or a * MultiXactId that may never have been seen on disk before, and we need * to make sure that there are XLOG entries covering those ID numbers. * Else the same IDs might be re-used after a crash, which would be * disastrous if this page made it to disk before the crash. Essentially * we have to enforce the WAL log-before-data rule even in this case. * (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG * entries for everything anyway.) */ But I don't really think that doing full-blown WAL tuple-locking WAL logging is really the right solution. - The "next xid" concerns are at least as easily solved by WAL logging a distinct "highest xid assigned" record when necessary. Either by having a shared memory variable saying "latestLoggedXid" or such, or by having end-of-recovery advance nextXid to beyond what ExtendCLOG() extended to. That reduces the overhead to at most once-per-xact (and commonly smaller) or nothing, respectively. - While there's obviously a good bit of simplicity ensuring that a replica is exactly the same ("Also, in a PITR log-shipping or 2PC environment ..."), we don't actually enforce that strictly anyway - so I am not sure why it's necessary to pay the price here. But maybe I'm all wet here, I certainly haven't had enough coffee. Greetings, Andres Freund
Peter Geoghegan <pg@bowt.ie> writes: > On Mon, Apr 12, 2021 at 9:19 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Hence, I propose the attached. It passes check-world, but that proves >> absolutely nothing of course :-(. I wonder if there is any way to >> exercise these code paths deterministically. > This approach seems reasonable to me. At least you've managed to > structure the visibility map page pin check as concomitant with the > existing space recheck. Thanks for looking it over. Do you have an opinion on whether or not to back-patch? As far as we know, these bugs aren't exposed in the back branches for lack of code that would set the all-visible flag without superexclusive lock. But I'd still say that heap_update is failing to honor its API contract in these edge cases, and that seems like something that could bite us after future back-patches. Or there might be third-party code that can set all-visible flags. So I'm kind of tempted to back-patch, but it's a judgment call. regards, tom lane
On Mon, Apr 12, 2021 at 6:33 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thanks for looking it over. Do you have an opinion on whether or not > to back-patch? As far as we know, these bugs aren't exposed in the > back branches for lack of code that would set the all-visible flag > without superexclusive lock. But I'd still say that heap_update > is failing to honor its API contract in these edge cases, and that > seems like something that could bite us after future back-patches. If we assume that a scenario like the one we've been debugging can never happen in the backbranches, then we must also assume that your fix has negligible risk in the backbranches, because of how it is structured. And so I agree -- I lean towards backpatching. -- Peter Geoghegan