Re: Always show correct error message for statement timeouts, fixes random buildfarm failures
| От | Jelte Fennema-Nio |
|---|---|
| Тема | Re: Always show correct error message for statement timeouts, fixes random buildfarm failures |
| Дата | |
| Msg-id | DFKEG9KRC5B2.3NUOM3UZY7QTV@jeltef.nl обсуждение исходный текст |
| Ответ на | Re: Always show correct error message for statement timeouts, fixes random buildfarm failures (Tom Lane <tgl@sss.pgh.pa.us>) |
| Ответы |
Re: Always show correct error message for statement timeouts, fixes random buildfarm failures
|
| Список | pgsql-hackers |
On Thu, 8 Jan 2026 at 18:50, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I took another look and I still don't like it. The intent of the code
> is that once we're out of a query, a query cancel that arrived
> meanwhile should be a no-op. I completely disagree with the idea that
> a late-arriving cancel should cancel the next query. (From the user's
> standpoint: if you're a bit slow to hit control-C and so your query
> finishes anyway, and then you go for coffee, would you be surprised
> to come back awhile later and enter a new query and have it fail?)
> And I think the same should apply to statement timeouts: if one
> happens slightly too late to cancel the current query, that does not
> mean it should cancel the next one.
The fact that you're thinking that my intent of the patch was this, that
means I did a terrible job at explaining the bug that this patch is
fixing. So I'll try doing a better job now.
The bug this is fixing has nothing to do with multiple queries. It
occurs within a single query being processed by exec_simple_query:
1. At line 1051 start_xact_command is called and then calls
enable_statement_timeout to start the timeout for the query.
2. The code continues to line 1148 where start_xact_command is called AGAIN.
3. Now the timeout occurs (could also happen a bit earlier, as long as
no CHECK_FOR_INTERRUPTS is called). This sets the statement timeout
indicator and calls remove_timeout_index to mark the timeout as
inactive.
4. Now the second start_xact_command calls enable_statement_timeout AGAIN.
5. Because get_timeout_active returns false now due to the timeout,
enable_timeout_after is called again which resets the indicator flag
6. New CHECK_FOR_INTERRUPTS is called wherever in the query and
QueryCancelPending is still set (correctly), but the matching
indicator flag was lost.
The problem is that while the comment above 4 shows that the intent of
calling enable_statement_timeout again without an interceding
finish_xact_command is a no-op:
/*
* Start statement timeout if necessary. Note that this'll intentionally
* not reset the clock on an already started timeout, to avoid the timing
* overhead when start_xact_command() is invoked repeatedly, without an
* interceding finish_xact_command() (e.g. parse/bind/execute). If that's
* not desired, the timeout has to be disabled explicitly.
*/
But that's not actually the case if you run into this race where the
timeout has already fired, because then It resets the just-before-set
indicator flag. Note that the comment also explains that this is not the
only place where this can happen.
> /*
> * (5) disable async signal conditions again.
> *
> * Query cancel is supposed to be a no-op when there is no query in
> * progress, so if a query cancel arrived while we were idle, just
> * reset QueryCancelPending. ProcessInterrupts() has that effect when
> * it's called when DoingCommandRead is set, so check for interrupts
> * before resetting DoingCommandRead.
> */
> CHECK_FOR_INTERRUPTS();
> DoingCommandRead = false;
>
> We are taking care to discard a stale QueryCancelPending flag, but
> this code is ignorant of the fact that timeouts now also feed into
> QueryCancelPending. Perhaps we should clear the timeout indicators
> here? Or maybe using CHECK_FOR_INTERRUPTS here is too cute and
> we should just clear QueryCancelPending directly (and clear the
> indicators too, likely).
Thank you for pointing this out. I think (but I'm not exactly sure what
I was thinking over a year ago), that I had trusted the comment there,
and I had missed that my changes had inadvertently changed the behaviour
here because now the indicators could still be set at this
CHECK_FOR_INTERRUPTS call.
Notably CHECK_FOR_INTERRUPTS is already clearing the indicators (your
2nd proposed option), because it's calling get_timeout_indicator with
the true flag. Though that's not enough, because it still throws an
error if they are set even if DoingCommandRead=true. So the newly
attached patch changes that behaviour instead by checking if
DoingCommandRead is false before throwing any error. Does that resolve
your concerns about the possibility of cancels leaking to the next
query? (which to be extremely clear, was unintended)
> Also, I could buy making the proposed change in
> disable_statement_timeout without the one in enable_statement_timeout.
> I think that would have the desired effect that if ProcessInterrupts
> runs shortly after disable_statement_timeout it would not lie about
> the cause of the cancel, while not sacrificing the principle that
> a statement timeout can't kill the next statement.
As explained above the one in enable_statement_timeout is exactly the
one that is required to fix the bug in question.
> As a more thorough redesign, maybe we should get rid of the
> idea that QueryCancelPending represents all three of these
> conditions. That would mean a bit more duplication of logic
> in ProcessInterrupts, but I don't think that code is especially
> performance-critical.
I would like that very much, as I think it would make the code much
easier to follow (I think it's pretty hard to understand code
currently). The problem I see with actually doing that though, is that
StatementTimeoutHandler is sending SIGINT to the whole process group. So
unless we want to stop doing that, I don't see a way of separating the
indicators from QueryCancelPending.
Вложения
В списке pgsql-hackers по дате отправления: