Обсуждение: SKIP LOCKED assert triggered
The combination of these two statements in a transaction hits an Assert in heapam.c at line 4770 on REL_14_STABLE BEGIN; SELECT * FROM queue LIMIT 1 FOR UPDATE SKIP LOCKED; ... UPDATE queue SET status = 'UPDATED' WHERE id = :id; COMMIT; pgbench reliably finds this, running from inside a PL/pgSQL function. Alvaro suggests we just ignore the Assert, which on testing appears to be the right approach. Patch attached, which solves it for me. There is no formal test that does lock then update, so I have attempted to create one, but this has not successfully reproduced it (attached anyway), but this code is different from the pgbench test. -- Simon Riggs http://www.EnterpriseDB.com/
Вложения
On 11/12/21, 8:56 AM, "Simon Riggs" <simon.riggs@enterprisedb.com> wrote: > The combination of these two statements in a transaction hits an > Assert in heapam.c at line 4770 on REL_14_STABLE I've been unable to reproduce this. Do you have any tips for how to do so? Does there need to be some sort of concurrent workload? Nathan
On Wed, 1 Dec 2021 at 14:33, Bossart, Nathan <bossartn@amazon.com> wrote: > > On 11/12/21, 8:56 AM, "Simon Riggs" <simon.riggs@enterprisedb.com> wrote: > > The combination of these two statements in a transaction hits an > > Assert in heapam.c at line 4770 on REL_14_STABLE > > I've been unable to reproduce this. Do you have any tips for how to > do so? Does there need to be some sort of concurrent workload? That path is only ever taken when there are multiple sessions, and as I said, pgbench finds this reliably. I guess I didn't say "use -c 2" -- Simon Riggs http://www.EnterpriseDB.com/
On 2021-Dec-01, Simon Riggs wrote: > On Wed, 1 Dec 2021 at 14:33, Bossart, Nathan <bossartn@amazon.com> wrote: > > > > On 11/12/21, 8:56 AM, "Simon Riggs" <simon.riggs@enterprisedb.com> wrote: > > > The combination of these two statements in a transaction hits an > > > Assert in heapam.c at line 4770 on REL_14_STABLE > > > > I've been unable to reproduce this. Do you have any tips for how to > > do so? Does there need to be some sort of concurrent workload? > > That path is only ever taken when there are multiple sessions, and as > I said, pgbench finds this reliably. I guess I didn't say "use -c 2" Simon had sent me the pgbench scripts earlier, so I attach them here. I don't actually get a crash with -c2 or -c3, but I do get almost immediate crashes with -c4 and above. If I run it under "rr", it doesn't occur either. I suspect the rr framework kills concurrency in some way that hides the problem. I didn't find a way to reproduce it with isolationtester. (If somebody wants to play with a debugger, I find that it's much easier to reproduce by adding a short sleep after the UpdateXmaxHintBits() call in line 4735; but that sleep occurs in a session *other* than the one that dies. And under rr I still don't see a crash with a sleep there; in fact the sleep doesn't seem to occur at all, which is weird.) The patch does fix the crasher under pgbench, and I think it makes sense that you can get WouldBlock and yet have the tuple marked with XMAX_INVALID: if transaction A is writing the tuple, and transaction B is acquiring the tuple lock, then transaction C also tries to acquire the tuple lock but that returns nay (because of B), then transaction A completes, then transaction B could set the XMAX_INVALID flag in time for C to have a seizure in its way out. So patching the assertion to allow the case is correct. What I don't understand is why hasn't this been reported already: this bug is pretty old. My only explanation is that nobody runs sufficiently- concurrent load with SKIP LOCKED in assert-enabled builds. [1] https://www.postgresql.org/message-id/flat/CADLWmXUvd5Z%2BcFczi6Zj1WcTrXzipgP-wj0pZOWSaRUy%3DF0omQ%40mail.gmail.com -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
Вложения
On 2022-Jan-03, Alvaro Herrera wrote: > What I don't understand is why hasn't this been reported already: this > bug is pretty old. My only explanation is that nobody runs sufficiently- > concurrent load with SKIP LOCKED in assert-enabled builds. Pushed, thanks Simon for reporting this problem. > [1] https://www.postgresql.org/message-id/flat/CADLWmXUvd5Z%2BcFczi6Zj1WcTrXzipgP-wj0pZOWSaRUy%3DF0omQ%40mail.gmail.com Heh, I deleted a paragraph from my previous email and forgot to remove the footnote that it referenced. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "I love the Postgres community. It's all about doing things _properly_. :-)" (David Garamond)
On 2022-Jan-04, Alvaro Herrera wrote: > On 2022-Jan-03, Alvaro Herrera wrote: > > > What I don't understand is why hasn't this been reported already: this > > bug is pretty old. My only explanation is that nobody runs sufficiently- > > concurrent load with SKIP LOCKED in assert-enabled builds. > > Pushed, thanks Simon for reporting this problem. Wow, what an embarrassing problem this fix has. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "Pensar que el espectro que vemos es ilusorio no lo despoja de espanto, sólo le suma el nuevo terror de la locura" (Perelandra, C.S. Lewis)
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > Pushed, thanks Simon for reporting this problem. Umm ... Assert(TM_WouldBlock || !(tuple->t_data->t_infomask & HEAP_XMAX_INVALID)); AFAICS, this assertion condition is constant-true, because TM_WouldBlock is a nonzero constant. Perhaps you meant Assert(result == TM_WouldBlock || !(tuple->t_data->t_infomask & HEAP_XMAX_INVALID)); ? I'd be inclined to format it more like the adjacent Assert, too. regards, tom lane
On Tue, 4 Jan 2022 at 16:15, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > > Pushed, thanks Simon for reporting this problem. And causing another; my bad, apologies. > Umm ... > > Assert(TM_WouldBlock || !(tuple->t_data->t_infomask & HEAP_XMAX_INVALID)); > > AFAICS, this assertion condition is constant-true, > because TM_WouldBlock is a nonzero constant. Perhaps you meant > > Assert(result == TM_WouldBlock || !(tuple->t_data->t_infomask & HEAP_XMAX_INVALID)); Yes, I think that's what I meant. -- Simon Riggs http://www.EnterpriseDB.com/