Re: Why is src/test/modules/committs/t/002_standby.pl flaky?

Поиск
Список
Период
Сортировка
От Thomas Munro
Тема Re: Why is src/test/modules/committs/t/002_standby.pl flaky?
Дата
Msg-id CA+hUKGKFajQiUMerS4h_=eCNKoXMYoYSkyvwMZip7WV3KUckrg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Why is src/test/modules/committs/t/002_standby.pl flaky?  (Alexander Lakhin <exclusion@gmail.com>)
Ответы Re: Why is src/test/modules/committs/t/002_standby.pl flaky?  (Alexander Lakhin <exclusion@gmail.com>)
Re: Why is src/test/modules/committs/t/002_standby.pl flaky?  (Thomas Munro <thomas.munro@gmail.com>)
Re: Why is src/test/modules/committs/t/002_standby.pl flaky?  (Alexander Lakhin <exclusion@gmail.com>)
Список pgsql-hackers
On Wed, Jan 12, 2022 at 8:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
> By the look of things, you are right and this is the localhost-only issue.

But can't that be explained with timing races?  You change some stuff
around and it becomes less likely that you get a FIN to arrive in a
super narrow window, which I'm guessing looks something like: recv ->
EWOULDBLOCK, [receive FIN], wait -> FD_CLOSE, wait [hangs].  Note that
it's not happening on several Windows BF animals, and the ones it is
happening on only do it only every few weeks.

Here's a draft attempt at a fix.  First I tried to use recv(fd, &c, 1,
MSG_PEEK) == 0 to detect EOF, which seemed to me to be a reasonable
enough candidate, but somehow it corrupts the stream (!?), so I used
Alexander's POLLHUP idea, except I pushed it down to a more principled
place IMHO.  Then I suppressed it after the initial check because then
the logic from my earlier patch takes over, so stuff like FeBeWaitSet
doesn't suffer from extra calls, only these two paths that haven't
been converted to long-lived WESes yet.  Does this pass the test?

I wonder if this POLLHUP technique is reliable enough (I know that
wouldn't work on other systems[1], which is why I was trying to make
MSG_PEEK work...).

What about environment variable PG_TEST_USE_UNIX_SOCKETS=1, does it
reproduce with that set, and does the patch fix it?  I'm hoping that
explains some Windows CI failures from a nearby thread[2].

[1] https://illumos.topicbox.com/groups/developer/T5576767e764aa26a-Maf8f3460c2866513b0ac51bf
[2]
https://www.postgresql.org/message-id/flat/CALT9ZEG%3DC%3DJSypzt2gz6NoNtx-ew2tYHbwiOfY_xNo%2ByBY_%3Djw%40mail.gmail.com

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bharath Rupireddy
Дата:
Сообщение: Re: Add checkpoint and redo LSN to LogCheckpointEnd log message
Следующее
От: Amit Langote
Дата:
Сообщение: Re: ExecRTCheckPerms() and many prunable partitions