Re: Always show correct error message for statement timeouts, fixes random buildfarm failures
| От | Japin Li |
|---|---|
| Тема | Re: Always show correct error message for statement timeouts, fixes random buildfarm failures |
| Дата | |
| Msg-id | SYAPR01MB30382BDFBE6D6B0D514F8F78B683A@SYAPR01MB3038.ausprd01.prod.outlook.com обсуждение исходный текст |
| Ответ на | Re: Always show correct error message for statement timeouts, fixes random buildfarm failures ("Jelte Fennema-Nio" <postgres@jeltef.nl>) |
| Список | pgsql-hackers |
Hi, Jelte On Fri, 09 Jan 2026 at 23:08, "Jelte Fennema-Nio" <postgres@jeltef.nl> wrote: > On Thu, 8 Jan 2026 at 18:50, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I took another look and I still don't like it. The intent of the code >> is that once we're out of a query, a query cancel that arrived >> meanwhile should be a no-op. I completely disagree with the idea that >> a late-arriving cancel should cancel the next query. (From the user's >> standpoint: if you're a bit slow to hit control-C and so your query >> finishes anyway, and then you go for coffee, would you be surprised >> to come back awhile later and enter a new query and have it fail?) >> And I think the same should apply to statement timeouts: if one >> happens slightly too late to cancel the current query, that does not >> mean it should cancel the next one. > > The fact that you're thinking that my intent of the patch was this, that > means I did a terrible job at explaining the bug that this patch is > fixing. So I'll try doing a better job now. > > The bug this is fixing has nothing to do with multiple queries. It > occurs within a single query being processed by exec_simple_query: > 1. At line 1051 start_xact_command is called and then calls > enable_statement_timeout to start the timeout for the query. > 2. The code continues to line 1148 where start_xact_command is called AGAIN. > 3. Now the timeout occurs (could also happen a bit earlier, as long as > no CHECK_FOR_INTERRUPTS is called). This sets the statement timeout > indicator and calls remove_timeout_index to mark the timeout as > inactive. > 4. Now the second start_xact_command calls enable_statement_timeout AGAIN. > 5. Because get_timeout_active returns false now due to the timeout, > enable_timeout_after is called again which resets the indicator flag > 6. New CHECK_FOR_INTERRUPTS is called wherever in the query and > QueryCancelPending is still set (correctly), but the matching > indicator flag was lost. > > The problem is that while the comment above 4 shows that the intent of > calling enable_statement_timeout again without an interceding > finish_xact_command is a no-op: > > /* > * Start statement timeout if necessary. Note that this'll intentionally > * not reset the clock on an already started timeout, to avoid the timing > * overhead when start_xact_command() is invoked repeatedly, without an > * interceding finish_xact_command() (e.g. parse/bind/execute). If that's > * not desired, the timeout has to be disabled explicitly. > */ > > But that's not actually the case if you run into this race where the > timeout has already fired, because then It resets the just-before-set > indicator flag. Note that the comment also explains that this is not the > only place where this can happen. > >> /* >> * (5) disable async signal conditions again. >> * >> * Query cancel is supposed to be a no-op when there is no query in >> * progress, so if a query cancel arrived while we were idle, just >> * reset QueryCancelPending. ProcessInterrupts() has that effect when >> * it's called when DoingCommandRead is set, so check for interrupts >> * before resetting DoingCommandRead. >> */ >> CHECK_FOR_INTERRUPTS(); >> DoingCommandRead = false; >> >> We are taking care to discard a stale QueryCancelPending flag, but >> this code is ignorant of the fact that timeouts now also feed into >> QueryCancelPending. Perhaps we should clear the timeout indicators >> here? Or maybe using CHECK_FOR_INTERRUPTS here is too cute and >> we should just clear QueryCancelPending directly (and clear the >> indicators too, likely). > > Thank you for pointing this out. I think (but I'm not exactly sure what > I was thinking over a year ago), that I had trusted the comment there, > and I had missed that my changes had inadvertently changed the behaviour > here because now the indicators could still be set at this > CHECK_FOR_INTERRUPTS call. > > Notably CHECK_FOR_INTERRUPTS is already clearing the indicators (your > 2nd proposed option), because it's calling get_timeout_indicator with > the true flag. Though that's not enough, because it still throws an > error if they are set even if DoingCommandRead=true. So the newly > attached patch changes that behaviour instead by checking if > DoingCommandRead is false before throwing any error. Does that resolve > your concerns about the possibility of cancels leaking to the next > query? (which to be extremely clear, was unintended) > >> Also, I could buy making the proposed change in >> disable_statement_timeout without the one in enable_statement_timeout. >> I think that would have the desired effect that if ProcessInterrupts >> runs shortly after disable_statement_timeout it would not lie about >> the cause of the cancel, while not sacrificing the principle that >> a statement timeout can't kill the next statement. > > As explained above the one in enable_statement_timeout is exactly the > one that is required to fix the bug in question. > >> As a more thorough redesign, maybe we should get rid of the >> idea that QueryCancelPending represents all three of these >> conditions. That would mean a bit more duplication of logic >> in ProcessInterrupts, but I don't think that code is especially >> performance-critical. > > I would like that very much, as I think it would make the code much > easier to follow (I think it's pretty hard to understand code > currently). The problem I see with actually doing that though, is that > StatementTimeoutHandler is sending SIGINT to the whole process group. So > unless we want to stop doing that, I don't see a way of separating the > indicators from QueryCancelPending. I recently encountered this error [0]. I'm wondering if we could add a test case to cover this situation. [0] https://www.postgresql.org/message-id/MEAPR01MB30313EF4A7C22EA3495A861BB682A@MEAPR01MB3031.ausprd01.prod.outlook.com -- Regards, Japin Li ChengDu WenWu Information Technology Co., Ltd.
В списке pgsql-hackers по дате отправления: