Обсуждение: snapper and skink and fairywren (oh my!)

Поиск
Список
Период
Сортировка

snapper and skink and fairywren (oh my!)

От
Robert Haas
Дата:
I was thinking about whether it made sense to try to commit anything
and decided it would be a good idea to check how the buildfarm looks
first. It doesn't look great.

1. snapper failed 4 out of the last 5 runs in recoveryCheck. The
latest run as of this writing shows this:

[19:09:50] t/026_overwrite_contrecord.pl ........ ok    43136 ms
# poll_query_until timed out executing this query:
# SELECT '0/1415E310' <= replay_lsn AND state = 'streaming' FROM
pg_catalog.pg_stat_replication WHERE application_name = 'standby_1';
# expecting this output:
# t
# last actual query output:
#

2. skink failed the last run in this MiscCheck phase. I had no idea
what this phase was, because the name isn't very descriptive. It seems
that it runs some stuff in contrib and some stuff in src/test/modules,
which seems a bit confusing. Anyway, the failure here is:

test oldest_xmin                  ... FAILED     5533 ms

To find the failure, the regression test suite suggests looking at
file contrib/test_decoding/output_iso/regression.diffs or
regression.out. But neither file is in the buildfarm results so far as
I can see. It does have the postmaster log but I can't tell what's
gone wrong from looking at that. In fact I'm not really sure it's the
correct log file, because oldest_xmin.spec uses a slot called
"isolation_slot" and test_decoding/log/postmaster.log refers only to
"regression_slot", so it seems like this postmaster.log file might
cover only the pg_regress tests and not the results from the isolation
tester.

3. fairywren failed the last run in module-commit_tsCheck. It's unhappy because:

[16:30:02] t/002_standby.pl .... ok    13354 ms ( 0.06 usr  0.00 sys +
 1.11 cusr  3.20 csys =  4.37 CPU)
# poll_query_until timed out executing this query:
# SELECT '0/303C7D0'::pg_lsn <= pg_last_wal_replay_lsn()
# expecting this output:
# t
# last actual query output:
# f

I don't know what is causing any of these failures, and I don't know
if there's already some discussion elsewhere that I've missed, but
maybe this email will be helpful to someone. I also noticed that 2 out
of the 3 failures report 2dbb7b9b2279d064f66ce9008869fd0e2b794534 "Fix
pg_hba_file_rules for authentication method cert" as the only new
commit since the last run, and it's hardly believable that that commit
would have broken this. Nor do I see any other recent changes that
look like likely culprits. Apologies in advance if any of this is my
fault.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: snapper and skink and fairywren (oh my!)

От
Andrew Dunstan
Дата:
On 1/26/22 16:17, Robert Haas wrote
> 3. fairywren failed the last run in module-commit_tsCheck. It's unhappy because:
>
> [16:30:02] t/002_standby.pl .... ok    13354 ms ( 0.06 usr  0.00 sys +
>  1.11 cusr  3.20 csys =  4.37 CPU)
> # poll_query_until timed out executing this query:
> # SELECT '0/303C7D0'::pg_lsn <= pg_last_wal_replay_lsn()
> # expecting this output:
> # t
> # last actual query output:
> # f
>
> I don't know what is causing any of these failures, and I don't know
> if there's already some discussion elsewhere that I've missed, but
> maybe this email will be helpful to someone. I also noticed that 2 out
> of the 3 failures report 2dbb7b9b2279d064f66ce9008869fd0e2b794534 "Fix
> pg_hba_file_rules for authentication method cert" as the only new
> commit since the last run, and it's hardly believable that that commit
> would have broken this. Nor do I see any other recent changes that
> look like likely culprits. Apologies in advance if any of this is my
> fault.
>

Intermittent failures give a false positive against the latest set of
commits. These failures started happening regularly about 49 days ago.
So we need to look at what exactly happened around then. See
<https://buildfarm.postgresql.org/cgi-bin/show_failures.pl?max_days=90&stage=module-commit_tsCheck&filter=Submit>

It's very unlikely any of this is your fault. In any case, intermittent
failures are very hard to nail down.

I should add that both drongo and fairywren run on a single
AWS/EC2/Windows2019 instance. I haven't updated the OS in a while. so
I'm doing that to see if matters improve. FWIW. I just reproduced the
error on a similar instance where I'm testing some stuff for Andres'
project to make TAP tests more portable.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: snapper and skink and fairywren (oh my!)

От
Kyotaro Horiguchi
Дата:
At Wed, 26 Jan 2022 18:45:53 -0500, Andrew Dunstan <andrew@dunslane.net> wrote in 
> It's very unlikely any of this is your fault. In any case, intermittent
> failures are very hard to nail down.

The primary starts at 2022-01-26 16:30:06.613 the accepted a connectin
from the standby at 2022-01-26 16:30:09.911.

P: 2022-01-26 16:30:06.613 UTC [74668:1] LOG:  starting PostgreSQL 15devel on x86_64-w64-mingw32, compiled by gcc.exe
(Rev2,Built by MSYS2 project) 10.3.0, 64-bit
 
S: 2022-01-26 16:30:09.637 UTC [72728:1] LOG:  starting PostgreSQL 15devel on x86_64-w64-mingw32, compiled by gcc.exe
(Rev2,Built by MSYS2 project) 10.3.0, 64-bit
 
P: 2022-01-26 16:30:09.911 UTC [74096:3] [unknown] LOG:  replication connection authorized: user=pgrunner
application_name=standby
S: 2022-01-26 16:30:09.912 UTC [73932:1] LOG:  started streaming WAL from primary at 0/3000000 on timeline 1

After this the primary restarts.

P: 2022-01-26 16:30:11.832 UTC [74668:7] LOG:  database system is shut down
P: 2022-01-26 16:30:12.010 UTC [72140:1] LOG:  starting PostgreSQL 15devel on x86_64-w64-mingw32, compiled by gcc.exe
(Rev2,Built by MSYS2 project) 10.3.0, 64-bit
 

But the standby doesn't notice the disconnection and continue with the
poll_query_until() on the stale connection. But the query should have
executed after reconnection finally.  The following log lines are not
thinned out.

S: 2022-01-26 16:30:09.912 UTC [73932:1] LOG:  started streaming WAL from primary at 0/3000000 on timeline 1
S: 2022-01-26 16:30:12.825 UTC [72760:1] [unknown] LOG:  connection received: host=127.0.0.1 port=60769
S: 2022-01-26 16:30:12.830 UTC [72760:2] [unknown] LOG:  connection authenticated: identity="EC2AMAZ-P7KGG90\\pgrunner"
method=sspi
(C:/tools/msys64/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/modules/commit_ts/tmp_check/t_003_standby_2_standby_data/pgdata/pg_hba.conf:2)
S: 2022-01-26 16:30:12.830 UTC [72760:3] [unknown] LOG:  connection authorized: user=pgrunner database=postgres
application_name=003_standby_2.pl
S: 2022-01-26 16:30:12.838 UTC [72760:4] 003_standby_2.pl LOG:  statement: SELECT '0/303C7D0'::pg_lsn <=
pg_last_wal_replay_lsn()

Since the standby dones't receive WAL from the restarted server so the
poll_query_until() doesn't return.

I'm not sure why walsender of the standby continued running not knowing the primary has been once dead for such a long
time.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: snapper and skink and fairywren (oh my!)

От
Tom Lane
Дата:
Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:
> I'm not sure why walsender of the standby continued running not knowing the primary has been once dead for such a
longtime. 

Isn't this precisely the problem that made us revert the "graceful
disconnection" patch [1]?  Munro seems to have a theory about
why that broke things, but in any case it's not Haas' fault ;-)

            regards, tom lane

[1]
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BOeoETZQ%3DQw5Ub5h3tmwQhBmDA%3DnuNO3KG%3DzWfUypFAw%40mail.gmail.com



Re: snapper and skink and fairywren (oh my!)

От
Noah Misch
Дата:
On Wed, Jan 26, 2022 at 04:17:19PM -0500, Robert Haas wrote:
> 1. snapper failed 4 out of the last 5 runs in recoveryCheck. The
> latest run as of this writing shows this:
> 
> [19:09:50] t/026_overwrite_contrecord.pl ........ ok    43136 ms
> # poll_query_until timed out executing this query:
> # SELECT '0/1415E310' <= replay_lsn AND state = 'streaming' FROM
> pg_catalog.pg_stat_replication WHERE application_name = 'standby_1';
> # expecting this output:
> # t
> # last actual query output:
> #

I expect commit ce6d793 will end the failures on snapper, kittiwake, and
tadarida.  I apologize for letting the damage last months.



Re: snapper and skink and fairywren (oh my!)

От
Andrew Dunstan
Дата:
On 1/26/22 18:45, Andrew Dunstan wrote:
> On 1/26/22 16:17, Robert Haas wrote
>> 3. fairywren failed the last run in module-commit_tsCheck. It's unhappy because:
>>
> Intermittent failures give a false positive against the latest set of
> commits. These failures started happening regularly about 49 days ago.
> So we need to look at what exactly happened around then. See
> <https://buildfarm.postgresql.org/cgi-bin/show_failures.pl?max_days=90&stage=module-commit_tsCheck&filter=Submit>


Specifically, the issue is discussed here

<https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BOeoETZQ%3DQw5Ub5h3tmwQhBmDA%3DnuNO3KG%3DzWfUypFAw%40mail.gmail.com>
Tom has already reverted the offending changes on the back branches.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com