Обсуждение: Re: Always show correct error message for statement timeouts, fixes random buildfarm failures
Re: Always show correct error message for statement timeouts, fixes random buildfarm failures
От
Jelte Fennema-Nio
Дата:
On Tue, 18 Feb 2025 at 10:29, Jelte Fennema-Nio <me@jeltef.nl> wrote: > > So I think the solution in 0002, while debatable if strictly correct, > > is the only fix that we can easily do. Also I personally believe the > > behaviour resulting from 0002 is totally correct: The new behaviour > > would be that if a timeout occurred, right before it was disabled or > > reset, but the interrupt was not processed yet, then we process that > > timeout as normal. That seems totally reasonable behaviour to me from > > the perspective of an end user: You get a timeout error when the > > timeout occurred before the timeout was disabled/reset. I still think there's no reason for this not to be committed, as it fixes both a bug and a flaky test. On Tue, 18 Feb 2025 at 10:29, Jelte Fennema-Nio <me@jeltef.nl> wrote: > > (was part of "Add non-blocking version of PQcancel" thread[1]) > > The commitfest app notified me that this patch needed a rebase. While > doing that I'm also moving this patch out of the original thread it > was part of. It fixes a bug that causes this buildfarm failure[2]. In > rare cases the current code would report "canceling statement due to > user request", even though it was actually canceled due to a statement > timeout, so it should have this error message: "canceling statement > due to statement timeout". A way to reproduce this problem locally was > provided by Alexander[3]. > > I took it out of the original thread because that has become > impossible to follow (at least for me, so I expect also for others) > due to discussing 3 different random failures involving cancel > requests. And this specific one doesn't even have anything to do with > the original non-blocking PQCancel work. > > I have inlined the parts of emails that are relevant to this patch > below. That way others don't have to dissect that thread. (for context > this patch is often, but not always, called patch 0002 in the original > thread). > > My initial email containing the patch[4]: > > On Mon, 24 Jun 2024 at 00:59, Jelte Fennema-Nio <postgres@jeltef.nl> wrote: > > On Sat, 22 Jun 2024 at 17:00, Alexander Lakhin <exclusion@gmail.com> wrote: > > > I also came across another failure of the test: > > > @@ -2774,7 +2774,7 @@ > > > BEGIN; > > > SET LOCAL statement_timeout = '10ms'; > > > select count(*) from ft1 CROSS JOIN ft2 CROSS JOIN ft4 CROSS JOIN ft5; -- this takes very long > > > -ERROR: canceling statement due to statement timeout > > > +ERROR: canceling statement due to user request > > > COMMIT; > > > > > > which is reproduced with a sleep added here: > > > @@ -1065,6 +1065,7 @@ exec_simple_query(const char *query_string) > > > */ > > > parsetree_list = pg_parse_query(query_string); > > > +pg_usleep(11000); > > > > After investigating, I realized this actually exposes a bug in our > > statement timeout logic. It has nothing to do with posgres_fdw and > > reproduces with any regular postgres query too. Attached is a patch > > that fixes this issue. This one should probably be backported. > > > Tom's review[5]: > > On Fri, 30 Aug 2024 at 22:12, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I think 0002 is probably outright wrong, or at least the change to > > disable_statement_timeout is. Once we get to that, we don't want > > to throw a timeout error any more, even if an interrupt was received > > just before it. > > My initial response to that review[6]: > > On Fri, 30 Aug 2024 at 23:24, Jelte Fennema-Nio <postgres@jeltef.nl> wrote: > > The disable_statement_timeout change was not the part of that patch > > that was necessary for stable output, only the change in the first > > branch of enable_statement_timeout was necessary. The reason being > > that enable_statement_timeout is called multiple times for a query, > > because start_xact_command is called multiple times in > > exec_simple_query. The change to disable_statement_timeout just seemed > > like the logical extension of that change, especially since there was > > basically a verbatim copy of disable_statement_timeout in the second > > branch of enable_statement_timeout. > > > > To make sure I understand your suggestion correctly: Are you saying > > you would want to completely remove the outstanding interrupt if it > > was caused by de statement_timout when disable_statement_timeout is > > called? Because I agree that would probably make sense, but that > > sounds like a more impactful change. But the current behaviour seems > > strictly worse than the behaviour proposed in the patch to me, because > > currently the backend would still be interrupted, but the error would > > indicate a reason for the interrupt that is simply incorrect i.e. it > > will say it was cancelled due to a user request, which never happened. > > My second response to that review[7] > > On Sat, 31 Aug 2024 at 09:08, Jelte Fennema-Nio <postgres@jeltef.nl> wrote: > > I also took a quick look at the code again, and completely removing > > the outstanding interrupt seems hard to do. Because there's no way to > > know if there were multiple causes for the interupt, i.e. someone > > could have pressed ctrl+c as well and we wouldn't want to undo that. > > > > So I think the solution in 0002, while debatable if strictly correct, > > is the only fix that we can easily do. Also I personally believe the > > behaviour resulting from 0002 is totally correct: The new behaviour > > would be that if a timeout occurred, right before it was disabled or > > reset, but the interrupt was not processed yet, then we process that > > timeout as normal. That seems totally reasonable behaviour to me from > > the perspective of an end user: You get a timeout error when the > > timeout occurred before the timeout was disabled/reset. > > > [1]: https://www.postgresql.org/message-id/flat/AM5PR83MB0178D3B31CA1B6EC4A8ECC42F7529@AM5PR83MB0178.EURPRD83.prod.outlook.com > [2]: https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#timeouts.spec_failed_because_of_statement_cancelled_due_to_unexpected_reason > [3]: https://www.postgresql.org/message-id/4db099c8-4a52-3cc4-e970-14539a319466%40gmail.com#a81203796f7257531d037e2ca6a0ac43 > [4]: https://www.postgresql.org/message-id/CAGECzQSBp%3DsCcMCy1txig7cV0VX9hDDb1UdZt3zpDpZe2X2K-Q%40mail.gmail.com#a8d453bddc52f0db0b0fa80676d352c5 > [5]: https://www.postgresql.org/message-id/584811.1725048717%40sss.pgh.pa.us#030de114b9a6a3eafd082a92fb95fa6e > [6]: https://www.postgresql.org/message-id/CAGECzQR2HqQTFUddz-Q6y4sfMtTA%2BLRE7WqJtpPSrJKKUXtH7w%40mail.gmail.com#702f7a3fe3936b30bdaad0c5264f5a08 > [7]: https://www.postgresql.org/message-id/CAGECzQQM1N0H3e9RbSqi4%3DmmFgfLaFNr6R7gNpFoePb9exRryg%40mail.gmail.com#f11b82da794d5996eca82339c2a17caa
Re: Always show correct error message for statement timeouts, fixes random buildfarm failures
От
Akshat Jaimini
Дата:
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: not tested Spec compliant: not tested Documentation: not tested Hi, Just reviewed the patch and it seems harmless. I think we should commit this. Regards, Akshat Jaimini The new status of this patch is: Ready for Committer
Akshat Jaimini <destrex271@gmail.com> writes:
> Just reviewed the patch and it seems harmless. I think we should commit this.
I took another look and I still don't like it. The intent of the code
is that once we're out of a query, a query cancel that arrived
meanwhile should be a no-op. I completely disagree with the idea that
a late-arriving cancel should cancel the next query. (From the user's
standpoint: if you're a bit slow to hit control-C and so your query
finishes anyway, and then you go for coffee, would you be surprised
to come back awhile later and enter a new query and have it fail?)
And I think the same should apply to statement timeouts: if one
happens slightly too late to cancel the current query, that does not
mean it should cancel the next one.
That's not to say that I don't think there's anything to do about
this. I just don't think this is what to do.
Looking around at the code, I think perhaps some blame should fall
on this bit in PostgresMain:
/*
* (5) disable async signal conditions again.
*
* Query cancel is supposed to be a no-op when there is no query in
* progress, so if a query cancel arrived while we were idle, just
* reset QueryCancelPending. ProcessInterrupts() has that effect when
* it's called when DoingCommandRead is set, so check for interrupts
* before resetting DoingCommandRead.
*/
CHECK_FOR_INTERRUPTS();
DoingCommandRead = false;
We are taking care to discard a stale QueryCancelPending flag, but
this code is ignorant of the fact that timeouts now also feed into
QueryCancelPending. Perhaps we should clear the timeout indicators
here? Or maybe using CHECK_FOR_INTERRUPTS here is too cute and
we should just clear QueryCancelPending directly (and clear the
indicators too, likely).
Also, I could buy making the proposed change in
disable_statement_timeout without the one in enable_statement_timeout.
I think that would have the desired effect that if ProcessInterrupts
runs shortly after disable_statement_timeout it would not lie about
the cause of the cancel, while not sacrificing the principle that
a statement timeout can't kill the next statement.
As a more thorough redesign, maybe we should get rid of the
idea that QueryCancelPending represents all three of these
conditions. That would mean a bit more duplication of logic
in ProcessInterrupts, but I don't think that code is especially
performance-critical.
regards, tom lane
Re: Always show correct error message for statement timeouts, fixes random buildfarm failures
От
"Jelte Fennema-Nio"
Дата:
On Thu, 8 Jan 2026 at 18:50, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I took another look and I still don't like it. The intent of the code
> is that once we're out of a query, a query cancel that arrived
> meanwhile should be a no-op. I completely disagree with the idea that
> a late-arriving cancel should cancel the next query. (From the user's
> standpoint: if you're a bit slow to hit control-C and so your query
> finishes anyway, and then you go for coffee, would you be surprised
> to come back awhile later and enter a new query and have it fail?)
> And I think the same should apply to statement timeouts: if one
> happens slightly too late to cancel the current query, that does not
> mean it should cancel the next one.
The fact that you're thinking that my intent of the patch was this, that
means I did a terrible job at explaining the bug that this patch is
fixing. So I'll try doing a better job now.
The bug this is fixing has nothing to do with multiple queries. It
occurs within a single query being processed by exec_simple_query:
1. At line 1051 start_xact_command is called and then calls
enable_statement_timeout to start the timeout for the query.
2. The code continues to line 1148 where start_xact_command is called AGAIN.
3. Now the timeout occurs (could also happen a bit earlier, as long as
no CHECK_FOR_INTERRUPTS is called). This sets the statement timeout
indicator and calls remove_timeout_index to mark the timeout as
inactive.
4. Now the second start_xact_command calls enable_statement_timeout AGAIN.
5. Because get_timeout_active returns false now due to the timeout,
enable_timeout_after is called again which resets the indicator flag
6. New CHECK_FOR_INTERRUPTS is called wherever in the query and
QueryCancelPending is still set (correctly), but the matching
indicator flag was lost.
The problem is that while the comment above 4 shows that the intent of
calling enable_statement_timeout again without an interceding
finish_xact_command is a no-op:
/*
* Start statement timeout if necessary. Note that this'll intentionally
* not reset the clock on an already started timeout, to avoid the timing
* overhead when start_xact_command() is invoked repeatedly, without an
* interceding finish_xact_command() (e.g. parse/bind/execute). If that's
* not desired, the timeout has to be disabled explicitly.
*/
But that's not actually the case if you run into this race where the
timeout has already fired, because then It resets the just-before-set
indicator flag. Note that the comment also explains that this is not the
only place where this can happen.
> /*
> * (5) disable async signal conditions again.
> *
> * Query cancel is supposed to be a no-op when there is no query in
> * progress, so if a query cancel arrived while we were idle, just
> * reset QueryCancelPending. ProcessInterrupts() has that effect when
> * it's called when DoingCommandRead is set, so check for interrupts
> * before resetting DoingCommandRead.
> */
> CHECK_FOR_INTERRUPTS();
> DoingCommandRead = false;
>
> We are taking care to discard a stale QueryCancelPending flag, but
> this code is ignorant of the fact that timeouts now also feed into
> QueryCancelPending. Perhaps we should clear the timeout indicators
> here? Or maybe using CHECK_FOR_INTERRUPTS here is too cute and
> we should just clear QueryCancelPending directly (and clear the
> indicators too, likely).
Thank you for pointing this out. I think (but I'm not exactly sure what
I was thinking over a year ago), that I had trusted the comment there,
and I had missed that my changes had inadvertently changed the behaviour
here because now the indicators could still be set at this
CHECK_FOR_INTERRUPTS call.
Notably CHECK_FOR_INTERRUPTS is already clearing the indicators (your
2nd proposed option), because it's calling get_timeout_indicator with
the true flag. Though that's not enough, because it still throws an
error if they are set even if DoingCommandRead=true. So the newly
attached patch changes that behaviour instead by checking if
DoingCommandRead is false before throwing any error. Does that resolve
your concerns about the possibility of cancels leaking to the next
query? (which to be extremely clear, was unintended)
> Also, I could buy making the proposed change in
> disable_statement_timeout without the one in enable_statement_timeout.
> I think that would have the desired effect that if ProcessInterrupts
> runs shortly after disable_statement_timeout it would not lie about
> the cause of the cancel, while not sacrificing the principle that
> a statement timeout can't kill the next statement.
As explained above the one in enable_statement_timeout is exactly the
one that is required to fix the bug in question.
> As a more thorough redesign, maybe we should get rid of the
> idea that QueryCancelPending represents all three of these
> conditions. That would mean a bit more duplication of logic
> in ProcessInterrupts, but I don't think that code is especially
> performance-critical.
I would like that very much, as I think it would make the code much
easier to follow (I think it's pretty hard to understand code
currently). The problem I see with actually doing that though, is that
StatementTimeoutHandler is sending SIGINT to the whole process group. So
unless we want to stop doing that, I don't see a way of separating the
indicators from QueryCancelPending.
Вложения
Hi, Jelte On Fri, 09 Jan 2026 at 23:08, "Jelte Fennema-Nio" <postgres@jeltef.nl> wrote: > On Thu, 8 Jan 2026 at 18:50, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I took another look and I still don't like it. The intent of the code >> is that once we're out of a query, a query cancel that arrived >> meanwhile should be a no-op. I completely disagree with the idea that >> a late-arriving cancel should cancel the next query. (From the user's >> standpoint: if you're a bit slow to hit control-C and so your query >> finishes anyway, and then you go for coffee, would you be surprised >> to come back awhile later and enter a new query and have it fail?) >> And I think the same should apply to statement timeouts: if one >> happens slightly too late to cancel the current query, that does not >> mean it should cancel the next one. > > The fact that you're thinking that my intent of the patch was this, that > means I did a terrible job at explaining the bug that this patch is > fixing. So I'll try doing a better job now. > > The bug this is fixing has nothing to do with multiple queries. It > occurs within a single query being processed by exec_simple_query: > 1. At line 1051 start_xact_command is called and then calls > enable_statement_timeout to start the timeout for the query. > 2. The code continues to line 1148 where start_xact_command is called AGAIN. > 3. Now the timeout occurs (could also happen a bit earlier, as long as > no CHECK_FOR_INTERRUPTS is called). This sets the statement timeout > indicator and calls remove_timeout_index to mark the timeout as > inactive. > 4. Now the second start_xact_command calls enable_statement_timeout AGAIN. > 5. Because get_timeout_active returns false now due to the timeout, > enable_timeout_after is called again which resets the indicator flag > 6. New CHECK_FOR_INTERRUPTS is called wherever in the query and > QueryCancelPending is still set (correctly), but the matching > indicator flag was lost. > > The problem is that while the comment above 4 shows that the intent of > calling enable_statement_timeout again without an interceding > finish_xact_command is a no-op: > > /* > * Start statement timeout if necessary. Note that this'll intentionally > * not reset the clock on an already started timeout, to avoid the timing > * overhead when start_xact_command() is invoked repeatedly, without an > * interceding finish_xact_command() (e.g. parse/bind/execute). If that's > * not desired, the timeout has to be disabled explicitly. > */ > > But that's not actually the case if you run into this race where the > timeout has already fired, because then It resets the just-before-set > indicator flag. Note that the comment also explains that this is not the > only place where this can happen. > >> /* >> * (5) disable async signal conditions again. >> * >> * Query cancel is supposed to be a no-op when there is no query in >> * progress, so if a query cancel arrived while we were idle, just >> * reset QueryCancelPending. ProcessInterrupts() has that effect when >> * it's called when DoingCommandRead is set, so check for interrupts >> * before resetting DoingCommandRead. >> */ >> CHECK_FOR_INTERRUPTS(); >> DoingCommandRead = false; >> >> We are taking care to discard a stale QueryCancelPending flag, but >> this code is ignorant of the fact that timeouts now also feed into >> QueryCancelPending. Perhaps we should clear the timeout indicators >> here? Or maybe using CHECK_FOR_INTERRUPTS here is too cute and >> we should just clear QueryCancelPending directly (and clear the >> indicators too, likely). > > Thank you for pointing this out. I think (but I'm not exactly sure what > I was thinking over a year ago), that I had trusted the comment there, > and I had missed that my changes had inadvertently changed the behaviour > here because now the indicators could still be set at this > CHECK_FOR_INTERRUPTS call. > > Notably CHECK_FOR_INTERRUPTS is already clearing the indicators (your > 2nd proposed option), because it's calling get_timeout_indicator with > the true flag. Though that's not enough, because it still throws an > error if they are set even if DoingCommandRead=true. So the newly > attached patch changes that behaviour instead by checking if > DoingCommandRead is false before throwing any error. Does that resolve > your concerns about the possibility of cancels leaking to the next > query? (which to be extremely clear, was unintended) > >> Also, I could buy making the proposed change in >> disable_statement_timeout without the one in enable_statement_timeout. >> I think that would have the desired effect that if ProcessInterrupts >> runs shortly after disable_statement_timeout it would not lie about >> the cause of the cancel, while not sacrificing the principle that >> a statement timeout can't kill the next statement. > > As explained above the one in enable_statement_timeout is exactly the > one that is required to fix the bug in question. > >> As a more thorough redesign, maybe we should get rid of the >> idea that QueryCancelPending represents all three of these >> conditions. That would mean a bit more duplication of logic >> in ProcessInterrupts, but I don't think that code is especially >> performance-critical. > > I would like that very much, as I think it would make the code much > easier to follow (I think it's pretty hard to understand code > currently). The problem I see with actually doing that though, is that > StatementTimeoutHandler is sending SIGINT to the whole process group. So > unless we want to stop doing that, I don't see a way of separating the > indicators from QueryCancelPending. I recently encountered this error [0]. I'm wondering if we could add a test case to cover this situation. [0] https://www.postgresql.org/message-id/MEAPR01MB30313EF4A7C22EA3495A861BB682A@MEAPR01MB3031.ausprd01.prod.outlook.com -- Regards, Japin Li ChengDu WenWu Information Technology Co., Ltd.