Re: conchuela timeouts since 2021-10-09 system upgrade

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: conchuela timeouts since 2021-10-09 system upgrade
Дата
Msg-id 83446.1635258579@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: conchuela timeouts since 2021-10-09 system upgrade  (Noah Misch <noah@leadboat.com>)
Ответы Re: conchuela timeouts since 2021-10-09 system upgrade  (Thomas Munro <thomas.munro@gmail.com>)
Список pgsql-bugs
Noah Misch <noah@leadboat.com> writes:
> On Tue, Oct 26, 2021 at 02:03:54AM -0400, Tom Lane wrote:
>> Or more
>> practically, use advisory locks in that script to enforce that only one
>> runs at once.

> The author did try that.

Hmm ... that ought to have done the trick, I'd think.  However:

> Both sound doable, but I don't expect either to fix prairiedog's trouble.

Yeah :-(.  I think this test is somehow stumbling over a pre-existing bug.

>> So what we have is that libpq thinks it's sent the next DROP INDEX,
>> but the backend hasn't seen it.

> Thanks for isolating that.

The plot thickens.  When I went back to look at that machine this morning,
I found this in the postmaster log:

2021-10-26 02:52:09.324 EDT [1013] 002_cic.pl LOG:  statement: DROP INDEX CONCURRENTLY idx;
2021-10-26 02:52:09.352 EDT [1013] 002_cic.pl LOG:  could not send data to client: Broken pipe
2021-10-26 02:52:09.352 EDT [1013] 002_cic.pl FATAL:  connection to client lost

The timestamps correspond (more or less anyway) to when I killed off the
stuck test run and went to bed.  So the DROP command *was* sent, and it
was eventually received by the backend, but it seems to have taken killing
the pgbench process to do it.

I think this probably exonerates the pgbench/libpq side of things, and
instead we have to wonder about a backend or kernel bug.  A kernel bug
could possibly explain the unexplainable connection to what's happening on
some other file descriptor.  I'd be prepared to believe that prairiedog's
ancient macOS version has some weird bug preventing kevent() from noticing
available data ... but (a) surely conchuela wouldn't share such a bug,
and (b) we've been using kevent() for a couple years now, so how come
we didn't see this before?

Still baffled.  I'm currently experimenting to see if the bug reproduces
when latch.c is made to use poll() instead of kevent().  But the failure
rate was low enough that it'll be hours before I can say confidently
that it doesn't (unless, of course, it does).

            regards, tom lane



В списке pgsql-bugs по дате отправления:

Предыдущее
От: Noah Misch
Дата:
Сообщение: Re: conchuela timeouts since 2021-10-09 system upgrade
Следующее
От: Pavel Borisov
Дата:
Сообщение: Re: BUG #17246: Feature request for adoptive indexes