Обсуждение: snapper and skink and fairywren (oh my!)
I was thinking about whether it made sense to try to commit anything and decided it would be a good idea to check how the buildfarm looks first. It doesn't look great. 1. snapper failed 4 out of the last 5 runs in recoveryCheck. The latest run as of this writing shows this: [19:09:50] t/026_overwrite_contrecord.pl ........ ok 43136 ms # poll_query_until timed out executing this query: # SELECT '0/1415E310' <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'standby_1'; # expecting this output: # t # last actual query output: # 2. skink failed the last run in this MiscCheck phase. I had no idea what this phase was, because the name isn't very descriptive. It seems that it runs some stuff in contrib and some stuff in src/test/modules, which seems a bit confusing. Anyway, the failure here is: test oldest_xmin ... FAILED 5533 ms To find the failure, the regression test suite suggests looking at file contrib/test_decoding/output_iso/regression.diffs or regression.out. But neither file is in the buildfarm results so far as I can see. It does have the postmaster log but I can't tell what's gone wrong from looking at that. In fact I'm not really sure it's the correct log file, because oldest_xmin.spec uses a slot called "isolation_slot" and test_decoding/log/postmaster.log refers only to "regression_slot", so it seems like this postmaster.log file might cover only the pg_regress tests and not the results from the isolation tester. 3. fairywren failed the last run in module-commit_tsCheck. It's unhappy because: [16:30:02] t/002_standby.pl .... ok 13354 ms ( 0.06 usr 0.00 sys + 1.11 cusr 3.20 csys = 4.37 CPU) # poll_query_until timed out executing this query: # SELECT '0/303C7D0'::pg_lsn <= pg_last_wal_replay_lsn() # expecting this output: # t # last actual query output: # f I don't know what is causing any of these failures, and I don't know if there's already some discussion elsewhere that I've missed, but maybe this email will be helpful to someone. I also noticed that 2 out of the 3 failures report 2dbb7b9b2279d064f66ce9008869fd0e2b794534 "Fix pg_hba_file_rules for authentication method cert" as the only new commit since the last run, and it's hardly believable that that commit would have broken this. Nor do I see any other recent changes that look like likely culprits. Apologies in advance if any of this is my fault. -- Robert Haas EDB: http://www.enterprisedb.com
On 1/26/22 16:17, Robert Haas wrote > 3. fairywren failed the last run in module-commit_tsCheck. It's unhappy because: > > [16:30:02] t/002_standby.pl .... ok 13354 ms ( 0.06 usr 0.00 sys + > 1.11 cusr 3.20 csys = 4.37 CPU) > # poll_query_until timed out executing this query: > # SELECT '0/303C7D0'::pg_lsn <= pg_last_wal_replay_lsn() > # expecting this output: > # t > # last actual query output: > # f > > I don't know what is causing any of these failures, and I don't know > if there's already some discussion elsewhere that I've missed, but > maybe this email will be helpful to someone. I also noticed that 2 out > of the 3 failures report 2dbb7b9b2279d064f66ce9008869fd0e2b794534 "Fix > pg_hba_file_rules for authentication method cert" as the only new > commit since the last run, and it's hardly believable that that commit > would have broken this. Nor do I see any other recent changes that > look like likely culprits. Apologies in advance if any of this is my > fault. > Intermittent failures give a false positive against the latest set of commits. These failures started happening regularly about 49 days ago. So we need to look at what exactly happened around then. See <https://buildfarm.postgresql.org/cgi-bin/show_failures.pl?max_days=90&stage=module-commit_tsCheck&filter=Submit> It's very unlikely any of this is your fault. In any case, intermittent failures are very hard to nail down. I should add that both drongo and fairywren run on a single AWS/EC2/Windows2019 instance. I haven't updated the OS in a while. so I'm doing that to see if matters improve. FWIW. I just reproduced the error on a similar instance where I'm testing some stuff for Andres' project to make TAP tests more portable. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
At Wed, 26 Jan 2022 18:45:53 -0500, Andrew Dunstan <andrew@dunslane.net> wrote in > It's very unlikely any of this is your fault. In any case, intermittent > failures are very hard to nail down. The primary starts at 2022-01-26 16:30:06.613 the accepted a connectin from the standby at 2022-01-26 16:30:09.911. P: 2022-01-26 16:30:06.613 UTC [74668:1] LOG: starting PostgreSQL 15devel on x86_64-w64-mingw32, compiled by gcc.exe (Rev2,Built by MSYS2 project) 10.3.0, 64-bit S: 2022-01-26 16:30:09.637 UTC [72728:1] LOG: starting PostgreSQL 15devel on x86_64-w64-mingw32, compiled by gcc.exe (Rev2,Built by MSYS2 project) 10.3.0, 64-bit P: 2022-01-26 16:30:09.911 UTC [74096:3] [unknown] LOG: replication connection authorized: user=pgrunner application_name=standby S: 2022-01-26 16:30:09.912 UTC [73932:1] LOG: started streaming WAL from primary at 0/3000000 on timeline 1 After this the primary restarts. P: 2022-01-26 16:30:11.832 UTC [74668:7] LOG: database system is shut down P: 2022-01-26 16:30:12.010 UTC [72140:1] LOG: starting PostgreSQL 15devel on x86_64-w64-mingw32, compiled by gcc.exe (Rev2,Built by MSYS2 project) 10.3.0, 64-bit But the standby doesn't notice the disconnection and continue with the poll_query_until() on the stale connection. But the query should have executed after reconnection finally. The following log lines are not thinned out. S: 2022-01-26 16:30:09.912 UTC [73932:1] LOG: started streaming WAL from primary at 0/3000000 on timeline 1 S: 2022-01-26 16:30:12.825 UTC [72760:1] [unknown] LOG: connection received: host=127.0.0.1 port=60769 S: 2022-01-26 16:30:12.830 UTC [72760:2] [unknown] LOG: connection authenticated: identity="EC2AMAZ-P7KGG90\\pgrunner" method=sspi (C:/tools/msys64/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/modules/commit_ts/tmp_check/t_003_standby_2_standby_data/pgdata/pg_hba.conf:2) S: 2022-01-26 16:30:12.830 UTC [72760:3] [unknown] LOG: connection authorized: user=pgrunner database=postgres application_name=003_standby_2.pl S: 2022-01-26 16:30:12.838 UTC [72760:4] 003_standby_2.pl LOG: statement: SELECT '0/303C7D0'::pg_lsn <= pg_last_wal_replay_lsn() Since the standby dones't receive WAL from the restarted server so the poll_query_until() doesn't return. I'm not sure why walsender of the standby continued running not knowing the primary has been once dead for such a long time. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes: > I'm not sure why walsender of the standby continued running not knowing the primary has been once dead for such a longtime. Isn't this precisely the problem that made us revert the "graceful disconnection" patch [1]? Munro seems to have a theory about why that broke things, but in any case it's not Haas' fault ;-) regards, tom lane [1] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BOeoETZQ%3DQw5Ub5h3tmwQhBmDA%3DnuNO3KG%3DzWfUypFAw%40mail.gmail.com
On Wed, Jan 26, 2022 at 04:17:19PM -0500, Robert Haas wrote: > 1. snapper failed 4 out of the last 5 runs in recoveryCheck. The > latest run as of this writing shows this: > > [19:09:50] t/026_overwrite_contrecord.pl ........ ok 43136 ms > # poll_query_until timed out executing this query: > # SELECT '0/1415E310' <= replay_lsn AND state = 'streaming' FROM > pg_catalog.pg_stat_replication WHERE application_name = 'standby_1'; > # expecting this output: > # t > # last actual query output: > # I expect commit ce6d793 will end the failures on snapper, kittiwake, and tadarida. I apologize for letting the damage last months.
On 1/26/22 18:45, Andrew Dunstan wrote: > On 1/26/22 16:17, Robert Haas wrote >> 3. fairywren failed the last run in module-commit_tsCheck. It's unhappy because: >> > Intermittent failures give a false positive against the latest set of > commits. These failures started happening regularly about 49 days ago. > So we need to look at what exactly happened around then. See > <https://buildfarm.postgresql.org/cgi-bin/show_failures.pl?max_days=90&stage=module-commit_tsCheck&filter=Submit> Specifically, the issue is discussed here <https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BOeoETZQ%3DQw5Ub5h3tmwQhBmDA%3DnuNO3KG%3DzWfUypFAw%40mail.gmail.com> Tom has already reverted the offending changes on the back branches. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com