Обсуждение: raise ERROR between EndPrepare and PostPrepare_Locks causes ROLLBACK 2pc PAINC
Hi,
With the following hack in PrepareTransaction:
@@ -2697,6 +2697,8 @@ PrepareTransaction(void)
*/
EndPrepare(gxact);
+ elog(ERROR, "some error");
+
/*
* Now we clean up backend-internal state and release internal resources.
*/
Then we can do the following test:
CREATE TABLE t(a int);
begin;
insert into t values(1);
prepare transaction 'foo';
ERROR: some error. // before the lock transfer
rollback prepared 'foo';
PANIC: failed to re-find shared lock object // because lock is released
when ERROR.
One of the methods in my mind is to delay the END_CRIT_SECTION until
PostPrepare_Locks is finished, then the above case would become:
(1) ERROR
(2) Since CRIT_SECTION, error become to PAINC.
(3) client get PAINC and know the prepare failure.
(4). server does crash-recovery and the [2pc is recovered] (even client
has get error).
(5). We (as postgresql developer) should expected 2pc coordinator to run
"rollback prepared foo" anyway even the prepared failure and handle the
case where 'foo' doesn't exist.
I don't think it is a good solution because the mismatch between
(3) and (4). and the error handle in step (5).
This is not a real case to me, I just run into this with a
fault-injection testing framework. So after all is this something
which deserves a fix?
--
Best Regards
Andy Fan
Andy Fan <zhihuifan1213@163.com> writes: > Hi, > > With the following hack in PrepareTransaction: > > @@ -2697,6 +2697,8 @@ PrepareTransaction(void) > */ > EndPrepare(gxact); > > + elog(ERROR, "some error"); > + > /* > * Now we clean up backend-internal state and release internal resources. > */ > > Then we can do the following test: > CREATE TABLE t(a int); > begin; > insert into t values(1); > prepare transaction 'foo'; > ERROR: some error. // before the lock transfer > rollback prepared 'foo'; > PANIC: failed to re-find shared lock object // because lock is released > when ERROR. I found a similar but not exactly same case at 2014 [1] which might be helpful to recall a boarder understanding on this area. [1] https://www.postgresql.org/message-id/534AF601.1030007%40vmware.com -- Best Regards Andy Fan
Re: raise ERROR between EndPrepare and PostPrepare_Locks causes ROLLBACK 2pc PAINC
От
Michael Paquier
Дата:
On Wed, Mar 25, 2026 at 08:39:07AM +0800, Andy Fan wrote: > I found a similar but not exactly same case at 2014 [1] which > might be helpful to recall a boarder understanding on this area. > > [1] https://www.postgresql.org/message-id/534AF601.1030007%40vmware.com Incorrect shared state when an ERROR happens at an arbitrary location is usually bad, yes. For this one, your suggestion of delaying the end of the critical section started at StartPrepare() and ending in EndPrepare() is not an acceptable solution as far as I can see, unfortunately: it would mean doing a SyncRepWaitForLSN() while in a critical section, and I doubt we'd want to do that. Anyway, I doubt that this one is worth caring for. The current locking 2PC scheme means, as far as I remember, that it is not really possible to interact with an external command in a specific session between the EndPrepare() and the PostPrepare_Locks() calls. To put it in other words, let's imagine that we use a breakpoint between these two calls (or a wait injection point if you automate that). Is it possible for a second backend to mess with the state of the first backend waiting until its locks are transfered to the dummy PGPROC entry? That's what the 2014 thread is about: there was a race condition reachable between two sessions. If the answer to this question is yes, I'd agree that this is something that deserves a closer lookup. And before you ask: attempting to interact with a 2PC state from a second session with a first session waiting between these two points would not work: the 2PC entry is locked, cleaned up after EndPrepare() and PostPrepare_Locks() at PostPrepare_Twophase(). Trying to request an access to this entry fails, as the first backend is marked as locking it. A second backend attempting to lock it would fail, complaining that the 2PC entry with a GXID is "busy". SyncRepWaitForLSN() would be a problematic pattern between the EndPrepare() and the PostPrepare_Locks(), but we never ERROR there on purpose: even if we cancel while waiting for a transaction commit we'd just get a WARNING, meaning that we'd be able to transfer our locks anyway. Or perhaps you have a realistic scenario where it is possible to mess up with the shared state, outside a elog(ERROR) forced between these two points? -- Michael
Вложения
Michael Paquier <michael@paquier.xyz> writes: Hi, Thanks for showing interests for this topic! > On Wed, Mar 25, 2026 at 08:39:07AM +0800, Andy Fan wrote: >> I found a similar but not exactly same case at 2014 [1] which >> might be helpful to recall a boarder understanding on this area. >> >> [1] https://www.postgresql.org/message-id/534AF601.1030007%40vmware.com > > Incorrect shared state when an ERROR happens at an arbitrary location > is usually bad, yes. > > For this one, your suggestion of delaying the end of the critical > section started at StartPrepare() and ending in EndPrepare() is not an > acceptable solution as far as I can see, unfortunately: it would mean > doing a SyncRepWaitForLSN() while in a critical section, and I doubt > we'd want to do that. I do have a more general question: (Q1). what is critical section designed to? (Q2). What is the badness if we put more code into it except the ERROR->PANIC logic? Then (Q3) will SyncRepWaitForLSN be possible to raise ERROR? if no for now and possible in future, then my Q2 raise. > Anyway, I doubt that this one is worth caring for. The current locking > 2PC scheme means, as far as I remember, that it is not really possible > to interact with an external command in a specific session between > the EndPrepare() and the PostPrepare_Locks() > calls. Then Q3 comes. The deeper answer might be Q2 or Q1. > To put it in other words, let's imagine that we use a breakpoint > between these two calls (or a wait injection point if you automate > that). Is it possible for a second backend to mess with the state of > the first backend waiting until its locks are transfered to the dummy > PGPROC entry? That's what the 2014 thread is about: there was a race > condition reachable between two sessions. This is true, so the issue 2014 thread is more critical than the current one and which has been fixed. > If the answer to this question is yes, I'd agree that this is > something that deserves a closer lookup. Generally yes.. But I can't stop thinking the Q3 -> Q2 -> Q1 when I want to accept this asnwer. > And before you ask: attempting to interact with a 2PC > state from a second session with a first session waiting between these > two points would not work: the 2PC entry is locked, cleaned up after > EndPrepare() and PostPrepare_Locks() at PostPrepare_Twophase(). > Trying to request an access to this entry fails, as the first backend > is marked as locking it. A second backend attempting to lock it would > fail, complaining that the 2PC entry with a GXID is "busy". I can understand what you are saying now, but what does it make difference on the above case? > SyncRepWaitForLSN() would be a problematic pattern between the > EndPrepare() and the PostPrepare_Locks(), but we never ERROR there on > purpose: even if we cancel while waiting for a transaction commit we'd > just get a WARNING, meaning that we'd be able to transfer our locks > anyway. again Q2 -> Q3. > Or perhaps you have a realistic scenario where it is possible to mess > up with the shared state, outside a elog(ERROR) forced between these > two points? No really. I just inject some exception on some predefined places. I even don't know why people defined this places before. As for me, I prefer to know MORE design points for the CRITIAL SECTION besides what I side before. -- Best Regards Andy Fan