Re: BUG #16811: Severe reproducible server backend crash
От | Tom Lane |
---|---|
Тема | Re: BUG #16811: Severe reproducible server backend crash |
Дата | |
Msg-id | 3965163.1612382840@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | Re: BUG #16811: Severe reproducible server backend crash (Tom Lane <tgl@sss.pgh.pa.us>) |
Список | pgsql-bugs |
I wrote: > I can reproduce it without anything extra. What's needed is to run > the problematic statement in extended query mode, which you can > do like this: > $ cat foo.sql > do $$ begin rollback; end $$; > $ pgbench -n -f foo.sql -M prepared > pgbench: error: client 0 aborted in command 0 (SQL) of script 0; perhaps the backend died while processing I spent some more time poking at this today. Thomas' intuition that commit 1cff1b95a is related was right: if you bisect with this test case, it'll finger that as the first bad commit. But I think that's blaming the messenger. What's happening is that in extended-query mode, the Portal containing the DO operation does not have its own copy of the parser-output statement list; portal->stmts is pointing at a list belonging to a CachedPlan. And if the DO decides to roll back the current transaction, we get to PortalReleaseCachedPlan which does /* * We must also clear portal->stmts which is now a dangling reference * to the cached plan's plan list. This protects any code that might * try to examine the Portal later. */ portal->stmts = NIL; "Examining the portal later" is not the problem though. Way back up the call stack is PortalRunMulti, which is iterating over that same statement list, and is now doing so with a dangling pointer. So the crash we're seeing occurs because portal->stmts no longer matches the list pointer that PortalRunMulti has. But the dangling-pointer hazard existed before that. There are two reasons that explain why we don't crash pre-v13: * A transaction abort that didn't cause longjmp out of PortalRunMulti could only have happened inside a CALL or DO block. Those are utility statements, so there is no way for them to be part of a multi-statement list. Therefore there's no hazard of PortalRunMulti trying to execute additional statements within the same portal (which I doubt would work if it did try). * Again because this must be a utility command, the CachedPlan we are using must be a generic one, therefore its CachedPlanSource is holding a refcount on it. Thus, when PortalReleaseCachedPlan releases the Portal's refcount, the CachedPlan's refcount does not go to zero and so it doesn't get freed, and thus the "dangling" list pointer can still be dereferenced to discover that indeed we are at the end of the list. This all seems like a house of cards: it really gives me the willies that we're allowing AtAbort_Portals to release resources belonging to an active portal. However, I don't see any near-term way around that. If it doesn't release resources then the transaction abort mechanisms will complain (and release the resources anyway). So I think that Thomas' "cargo cult" patch is on the right track: what we have to do is make PortalRunMulti robust against the possibility that the portal was zapped underneath it. But I feel that we ought to prevent it from trying to iterate the foreach() as well as not doing the lnext(), so I'm intending to add something like if (portal->stmts == NIL) break; rather than only skipping the lnext(). We can reorder the loop steps to do the lnext() at the bottom, and then this'll work easily. Also, even though we don't see this crash before v13, I want to back-patch further. Accessing a cache entry we no longer have any refcount for is just asking for trouble, even if it accidentally fails to fail right now. regards, tom lane
В списке pgsql-bugs по дате отправления:
Предыдущее
От: PG Bug reporting formДата:
Сообщение: BUG #16852: Latest postgres:10-alpine Docker image error
Следующее
От: PG Bug reporting formДата:
Сообщение: BUG #16853: Materialized view not behaving in fully MVCC-compliant way