Re: Always show correct error message for statement timeouts, fixes random buildfarm failures

Поиск
Список
Период
Сортировка
От Japin Li
Тема Re: Always show correct error message for statement timeouts, fixes random buildfarm failures
Дата
Msg-id SYAPR01MB30382BDFBE6D6B0D514F8F78B683A@SYAPR01MB3038.ausprd01.prod.outlook.com
обсуждение исходный текст
Ответ на Re: Always show correct error message for statement timeouts, fixes random buildfarm failures  ("Jelte Fennema-Nio" <postgres@jeltef.nl>)
Список pgsql-hackers
Hi, Jelte

On Fri, 09 Jan 2026 at 23:08, "Jelte Fennema-Nio" <postgres@jeltef.nl> wrote:
> On Thu, 8 Jan 2026 at 18:50, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I took another look and I still don't like it.  The intent of the code
>> is that once we're out of a query, a query cancel that arrived
>> meanwhile should be a no-op.  I completely disagree with the idea that
>> a late-arriving cancel should cancel the next query.  (From the user's
>> standpoint: if you're a bit slow to hit control-C and so your query
>> finishes anyway, and then you go for coffee, would you be surprised
>> to come back awhile later and enter a new query and have it fail?)
>> And I think the same should apply to statement timeouts: if one
>> happens slightly too late to cancel the current query, that does not
>> mean it should cancel the next one.
>
> The fact that you're thinking that my intent of the patch was this, that
> means I did a terrible job at explaining the bug that this patch is
> fixing. So I'll try doing a better job now.
>
> The bug this is fixing has nothing to do with multiple queries. It
> occurs within a single query being processed by exec_simple_query:
> 1. At line 1051 start_xact_command is called and then calls
>    enable_statement_timeout to start the timeout for the query.
> 2. The code continues to line 1148 where start_xact_command is called AGAIN.
> 3. Now the timeout occurs (could also happen a bit earlier, as long as
>    no CHECK_FOR_INTERRUPTS is called). This sets the statement timeout
>    indicator and calls remove_timeout_index to mark the timeout as
>    inactive.
> 4. Now the second start_xact_command calls enable_statement_timeout AGAIN.
> 5. Because get_timeout_active returns false now due to the timeout,
>    enable_timeout_after is called again which resets the indicator flag
> 6. New CHECK_FOR_INTERRUPTS is called wherever in the query and
>    QueryCancelPending is still set (correctly), but the matching
>    indicator flag was lost.
>
> The problem is that while the comment above 4 shows that the intent of
> calling enable_statement_timeout again without an interceding
> finish_xact_command is a no-op:
>
>     /*
>      * Start statement timeout if necessary.  Note that this'll intentionally
>      * not reset the clock on an already started timeout, to avoid the timing
>      * overhead when start_xact_command() is invoked repeatedly, without an
>      * interceding finish_xact_command() (e.g. parse/bind/execute).  If that's
>      * not desired, the timeout has to be disabled explicitly.
>      */
>
> But that's not actually the case if you run into this race where the
> timeout has already fired, because then It resets the just-before-set
> indicator flag. Note that the comment also explains that this is not the
> only place where this can happen.
>
>>         /*
>>          * (5) disable async signal conditions again.
>>          *
>>          * Query cancel is supposed to be a no-op when there is no query in
>>          * progress, so if a query cancel arrived while we were idle, just
>>          * reset QueryCancelPending. ProcessInterrupts() has that effect when
>>          * it's called when DoingCommandRead is set, so check for interrupts
>>          * before resetting DoingCommandRead.
>>          */
>>         CHECK_FOR_INTERRUPTS();
>>         DoingCommandRead = false;
>>
>> We are taking care to discard a stale QueryCancelPending flag, but
>> this code is ignorant of the fact that timeouts now also feed into
>> QueryCancelPending.  Perhaps we should clear the timeout indicators
>> here?  Or maybe using CHECK_FOR_INTERRUPTS here is too cute and
>> we should just clear QueryCancelPending directly (and clear the
>> indicators too, likely).
>
> Thank you for pointing this out. I think (but I'm not exactly sure what
> I was thinking over a year ago), that I had trusted the comment there,
> and I had missed that my changes had inadvertently changed the behaviour
> here because now the indicators could still be set at this
> CHECK_FOR_INTERRUPTS call.
>
> Notably CHECK_FOR_INTERRUPTS is already clearing the indicators (your
> 2nd proposed option), because it's calling get_timeout_indicator with
> the true flag. Though that's not enough, because it still throws an
> error if they are set even if DoingCommandRead=true. So the newly
> attached patch changes that behaviour instead by checking if
> DoingCommandRead is false before throwing any error. Does that resolve
> your concerns about the possibility of cancels leaking to the next
> query? (which to be extremely clear, was unintended)
>
>> Also, I could buy making the proposed change in
>> disable_statement_timeout without the one in enable_statement_timeout.
>> I think that would have the desired effect that if ProcessInterrupts
>> runs shortly after disable_statement_timeout it would not lie about
>> the cause of the cancel, while not sacrificing the principle that
>> a statement timeout can't kill the next statement.
>
> As explained above the one in enable_statement_timeout is exactly the
> one that is required to fix the bug in question.
>
>> As a more thorough redesign, maybe we should get rid of the
>> idea that QueryCancelPending represents all three of these
>> conditions.  That would mean a bit more duplication of logic
>> in ProcessInterrupts, but I don't think that code is especially
>> performance-critical.
>
> I would like that very much, as I think it would make the code much
> easier to follow (I think it's pretty hard to understand code
> currently). The problem I see with actually doing that though, is that
> StatementTimeoutHandler is sending SIGINT to the whole process group. So
> unless we want to stop doing that, I don't see a way of separating the
> indicators from QueryCancelPending.

I recently encountered this error [0]. I'm wondering if we could add a test
case to cover this situation.

[0]
https://www.postgresql.org/message-id/MEAPR01MB30313EF4A7C22EA3495A861BB682A@MEAPR01MB3031.ausprd01.prod.outlook.com

-- 
Regards,
Japin Li
ChengDu WenWu Information Technology Co., Ltd.



В списке pgsql-hackers по дате отправления: