Re: Why is src/test/modules/committs/t/002_standby.pl flaky?

Поиск
Список
Период
Сортировка
От Alexander Lakhin
Тема Re: Why is src/test/modules/committs/t/002_standby.pl flaky?
Дата
Msg-id 1fe6b898-1722-d190-03c0-bf4c4884a748@gmail.com
обсуждение исходный текст
Ответ на Re: Why is src/test/modules/committs/t/002_standby.pl flaky?  (Thomas Munro <thomas.munro@gmail.com>)
Ответы Re: Why is src/test/modules/committs/t/002_standby.pl flaky?  (Thomas Munro <thomas.munro@gmail.com>)
Список pgsql-hackers
10.01.2022 23:52, Thomas Munro wrote:
>> I'm yet to find out whether the other 
>> WaitLatchOrSocket' users (e. g. postgres_fdw) can suffer from the
>> disconnected socket state, but this approach definitely works for
>> walreceiver.
> I see where you're going: there might be safe call sequences and
> unsafe call sequences, and maybe walreceiver is asking for trouble by
> double-polling.  I'm not sure about that; I got the impression
> recently that it's possible to get FD_CLOSE while you still have
> buffered data to read, so then the next recv() will return > 0 and
> then we don't have any state left anywhere to remember that we saw
> FD_CLOSE, even if you're careful to poll and read in the ideal
> sequence.  I could be wrong, and it would be nice if there is an easy
> fix along those lines...  The documentation around FD_CLOSE is
> unclear.
I had no strong opinion regarding unsafe sequence, though initially I
suspected that exactly the second libpqrcv_PQgetResult call could cause
the issue. But after digging into WaitLatchOrSocket I'd inclined to put
the fix deeper to satisfy all possible callers.
At the other hand, I've shared Tom's concerns regarding other clients,
that can stuck on WaitForMultipleObjects() just as walreceiver does, and
hoped that only walreceiver suffer from a graceful server socket closing.
So to get these doubts cleared, I've made a simple test for postgres_fdw
(please look at the attachment; you can put it into
contrib/postgres_fdw/t and run `vcregress taptest contrib\postgres_fdw`).
This test shows for me:
===
...
t/001_disconnection.pl .. # 12:13:39.481084 executing query...
# 12:13:43.245277 result:       0
# 0|0

# 12:13:43.246342 executing query...
# 12:13:46.525924 result:       0
# 0|0

# 12:13:46.527097 executing query...
# 12:13:47.745176 result:       3
#
# psql:<stdin>:1: WARNING:  no connection to the server
# psql:<stdin>:1: ERROR:  FATAL:  terminating connection due to
administrator co
mmand
# server closed the connection unexpectedly
#       This probably means the server terminated abnormally
#       before or while processing the request.
# CONTEXT:  remote SQL command: FETCH 100 FROM c1
# 12:13:47.794612 executing query...
# 12:13:51.073318 result:       0
# 0|0

# 12:13:51.074347 executing query...
===

With the simple logging added to connection.c:
                /* Sleep until there's something to do */
elog(LOG, "pgfdw_get_result before WaitLatchOrSocket");
                wc = WaitLatchOrSocket(MyLatch,
                                       WL_LATCH_SET | WL_SOCKET_READABLE |
                                       WL_EXIT_ON_PM_DEATH,
                                       PQsocket(conn),
                                       -1L, PG_WAIT_EXTENSION);
elog(LOG, "pgfdw_get_result after WaitLatchOrSocket");

I see in 001_disconnection_local.log:
...
2022-01-11 15:13:52.875 MSK|Administrator|postgres|61dd747f.5e4|LOG: 
pgfdw_get_result after WaitLatchOrSocket
2022-01-11 15:13:52.875
MSK|Administrator|postgres|61dd747f.5e4|STATEMENT:  SELECT * FROM large
WHERE a = fx2(a)
2022-01-11 15:13:52.875 MSK|Administrator|postgres|61dd747f.5e4|LOG: 
pgfdw_get_result before WaitLatchOrSocket
2022-01-11 15:13:52.875
MSK|Administrator|postgres|61dd747f.5e4|STATEMENT:  SELECT * FROM large
WHERE a = fx2(a)
2022-01-11 15:14:36.976 MSK|||61dd74ac.840|DEBUG:  autovacuum:
processing database "postgres"
2022-01-11 15:14:51.088 MSK|Administrator|postgres|61dd747f.5e4|LOG: 
pgfdw_get_result after WaitLatchOrSocket
2022-01-11 15:14:51.088
MSK|Administrator|postgres|61dd747f.5e4|STATEMENT:  SELECT * FROM large
WHERE a = fx2(a)
2022-01-11 15:14:51.089 MSK|Administrator|postgres|61dd747f.5e4|LOG: 
pgfdw_get_result before WaitLatchOrSocket
2022-01-11 15:14:51.089
MSK|Administrator|postgres|61dd747f.5e4|STATEMENT:  SELECT * FROM large
WHERE a = fx2(a)
2022-01-11 15:15:37.006 MSK|||61dd74e9.9e8|DEBUG:  autovacuum:
processing database "postgres"
2022-01-11 15:16:37.116 MSK|||61dd7525.ad0|DEBUG:  autovacuum:
processing database "postgres"
2022-01-11 15:17:37.225 MSK|||61dd7561.6a0|DEBUG:  autovacuum:
processing database "postgres"
2022-01-11 15:18:36.916 MSK|||61dd7470.704|LOG:  checkpoint starting: time
...
2022-01-11 15:36:38.225 MSK|||61dd79d6.2a0|DEBUG:  autovacuum:
processing database "postgres"
...

So here we get similar hanging on WaitLatchOrSocket().
Just to make sure that it's indeed the same issue, I've removed socket
shutdown&close and the test executed to the end (several times). Argh.

Best regards,
Alexander

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Allie Crawford
Дата:
Сообщение: Re: [Ext:] Re: Stream Replication not working
Следующее
От: Allie Crawford
Дата:
Сообщение: Re: [Ext:] Re: Stream Replication not working