Re: Why is src/test/modules/committs/t/002_standby.pl flaky?
От | Alexander Lakhin |
---|---|
Тема | Re: Why is src/test/modules/committs/t/002_standby.pl flaky? |
Дата | |
Msg-id | 2174557f-ead8-430d-a07e-d92b50e69b65@gmail.com обсуждение исходный текст |
Ответ на | Re: Why is src/test/modules/committs/t/002_standby.pl flaky? (Thomas Munro <thomas.munro@gmail.com>) |
Список | pgsql-hackers |
13.01.2022 09:36, Thomas Munro wrote: > Here's a draft attempt at a fix. First I tried to use recv(fd, &c, 1, > MSG_PEEK) == 0 to detect EOF, which seemed to me to be a reasonable > enough candidate, but somehow it corrupts the stream (!?), so I used > Alexander's POLLHUP idea, except I pushed it down to a more principled > place IMHO. Then I suppressed it after the initial check because then > the logic from my earlier patch takes over, so stuff like FeBeWaitSet > doesn't suffer from extra calls, only these two paths that haven't > been converted to long-lived WESes yet. Does this pass the test? Yes, this fix eliminates the flakiness for me. The test commit_ts (with 002_standby and 003_standby_2) passed 2x200 iterations. It also makes my test postgres_fdw/001_disconnection pass reliably. > I wonder if this POLLHUP technique is reliable enough (I know that > wouldn't work on other systems[1], which is why I was trying to make > MSG_PEEK work...). > > What about environment variable PG_TEST_USE_UNIX_SOCKETS=1, does it > reproduce with that set, and does the patch fix it? I'm hoping that > explains some Windows CI failures from a nearby thread[2]. With PG_TEST_USE_UNIX_SOCKETS=1 the test commit_ts/002_standby fails on the unpatched HEAD: t/002_standby.pl .... 1/4 # poll_query_until timed out executing this query: # SELECT '0/303C628'::pg_lsn <= pg_last_wal_replay_lsn() # expecting this output: # t # last actual query output: # f # with stderr: # Looks like your test exited with 25 just after 1. t/002_standby.pl .... Dubious, test returned 25 (wstat 6400, 0x1900) 002_standby_primary.log contains: 2022-01-13 18:57:32.925 PST [1448] LOG: starting PostgreSQL 15devel, compiled by Visual C++ build 1928, 64-bit 2022-01-13 18:57:32.926 PST [1448] LOG: listening on Unix socket "C:/Users/1/AppData/Local/Temp/yOKQPH1FoO/.s.PGSQL.62733" The same with my postgres_fdw test: # 03:41:12.533246 result: 0 # 0|0 # 03:41:12.534758 executing query... # 03:41:14.267594 result: 3 # # psql:<stdin>:1: WARNING: no connection to the server # psql:<stdin>:1: ERROR: FATAL: terminating connection due to administrator command # server closed the connection unexpectedly # This probably means the server terminated abnormally # before or while processing the request. # CONTEXT: remote SQL command: FETCH 100 FROM c1 # 03:41:14.270449 executing query... # 03:41:14.334437 result: 3 # # psql:<stdin>:1: ERROR: could not connect to server "fpg" # DETAIL: connection to server on socket "C:/Users/1/AppData/Local/Temp/hJWD9mzPHM/.s.PGSQL.57414" failed: Connection refused (0x0000274D/10061) # Is the server running locally and accepting connections on that socket? # 03:41:14.336918 executing query... # 03:41:14.422381 result: 3 # # psql:<stdin>:1: ERROR: could not connect to server "fpg" # DETAIL: connection to server on socket "C:/Users/1/AppData/Local/Temp/hJWD9mzPHM/.s.PGSQL.57414" failed: Connection refused (0x0000274D/10061) # Is the server running locally and accepting connections on that socket? # 03:41:14.425628 executing query... ...hang... With the patch these tests pass successfully. I can also confirm that on Windows 10 20H2 (previous tests were performed on Windows Server 2012) the unpatched HEAD + PG_TEST_USE_UNIX_SOCKETS=1 hangs on src\test\recovery\001_stream_rep (on iterations 1, 1, 4 for me). (v9-0001-Add-option-for-amcheck-and-pg_amcheck-to-check-un.patch [1] not required to see that.) 001_stream_rep_primary.log contains: ... 2022-01-13 19:46:48.287 PST [1364] LOG: listening on Unix socket "C:/Users/1/AppData/Local/Temp/EWzapwiXjV/.s.PGSQL.58248" 2022-01-13 19:46:48.317 PST [6736] LOG: database system was shut down at 2022-01-13 19:46:37 PST 2022-01-13 19:46:48.331 PST [1364] LOG: database system is ready to accept connections 2022-01-13 19:46:49.536 PST [1088] 001_stream_rep.pl LOG: statement: CREATE TABLE tab_int AS SELECT generate_series(1,1002) AS a 2022-01-13 19:46:49.646 PST [3028] 001_stream_rep.pl LOG: statement: SELECT pg_current_wal_insert_lsn() 2022-01-13 19:46:49.745 PST [3360] 001_stream_rep.pl LOG: statement: SELECT '0/3023268' <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'standby_1'; ... 2022-01-13 19:50:39.755 PST [4924] 001_stream_rep.pl LOG: statement: SELECT '0/3023268' <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'standby_1'; 2022-01-13 19:50:39.928 PST [5924] 001_stream_rep.pl LOG: statement: SELECT '0/3023268' <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'standby_1'; Without PG_TEST_USE_UNIX_SOCKETS=1 and without the fix the 001_stream_rep hangs too (but on iterations 3, 8, 2). So it seems that using unix sockets increases the fail rate. With the fix 100 iterations with PG_TEST_USE_UNIX_SOCKETS=1 and 40 (and still counting) iterations without PG_TEST_USE_UNIX_SOCKETS pass. So the fix looks as absolutely working to me with the tests that we have for now. [1] https://www.postgresql.org/message-id/CALT9ZEHx2%2B9rqAeAANkUXJCsTueQqdx2Tt6ypaig9tozJkWvkQ%40mail.gmail.com
В списке pgsql-hackers по дате отправления: