Обсуждение: Race conditions in 019_replslot_limit.pl

Поиск

Список

Период

Сортировка

Race conditions in 019_replslot_limit.pl

От

Heikki Linnakangas

Дата:

15 февраля 2022 г., 21:29:20

While looking at recent failures in the new 028_pitr_timelines.pl 
recovery test, I noticed that there have been a few failures in the 
buildfarm in the recoveryCheck phase even before that, in the 
019_replslot_limit.pl test.

For example:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2022-02-14%2006%3A30%3A04

[07:42:23] t/018_wal_optimize.pl ................ ok    12403 ms ( 0.00 
usr  0.00 sys +  1.40 cusr  0.63 csys =  2.03 CPU)
# poll_query_until timed out executing this query:
# SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'
# expecting this output:
# lost
# last actual query output:
# unreserved

and:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2022-02-15%2011%3A00%3A08

#   Failed test 'have walsender pid 3682154
# 3682136'
#   at t/019_replslot_limit.pl line 335.
#                   '3682154
# 3682136'
#     doesn't match '(?^:^[0-9]+$)'

The latter looks like there are two walsenders active, which confuses 
the test. Not sure what's happening in the first case, but looks like 
some kind of a race condition at a quick glance.

Has anyone looked into these yet?

- Heikki

Re: Race conditions in 019_replslot_limit.pl

От

Andres Freund

Дата:

15 февраля 2022 г., 23:51:57

Hi,

On 2022-02-15 23:29:20 +0200, Heikki Linnakangas wrote:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2022-02-14%2006%3A30%3A04
> 
> [07:42:23] t/018_wal_optimize.pl ................ ok    12403 ms ( 0.00 usr
> 0.00 sys +  1.40 cusr  0.63 csys =  2.03 CPU)
> # poll_query_until timed out executing this query:
> # SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'
> # expecting this output:
> # lost
> # last actual query output:
> # unreserved

The relevant bit from the log is:
2022-02-14 07:42:27.817 CET [6209f9d3.68bab:3] LOG:  database system is ready to accept read-only connections
2022-02-14 07:42:27.819 CET [6209f9d3.68bb7:1] LOG:  started streaming WAL from primary at 0/1B00000 on timeline 1
2022-02-14 07:42:27.819 CET [6209f9d3.68bb7:2] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL
segment00000001000000000000001B has already been removed
 
2022-02-14 07:42:27.822 CET [6209f9d3.68bb9:1] LOG:  started streaming WAL from primary at 0/1B00000 on timeline 1
2022-02-14 07:42:27.822 CET [6209f9d3.68bb9:2] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL
segment00000001000000000000001B has already been removed
 

I think what happened is that there was no WAL to receive between the start of
the primary and the $node_primary3->wait_for_catchup($node_standby3);

Because the slot is created without reserving WAL that allows the primary to
remove the WAL segments without ever creating a slot based conflict. I think
that should be fixable by reserving the slot at creation time?


> and:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2022-02-15%2011%3A00%3A08
> 
> #   Failed test 'have walsender pid 3682154
> # 3682136'
> #   at t/019_replslot_limit.pl line 335.
> #                   '3682154
> # 3682136'
> #     doesn't match '(?^:^[0-9]+$)'
> 
> The latter looks like there are two walsenders active, which confuses the
> test.

Too bad there's no plain pid in the log_line_prefix. Looks like that used to be the
buildfarm default, and I haven't fixed that animals configuration...

%c apparently is hex(process startup time).hex(pid) in hex, so we're looking
for 382f58... Confirmed by the slot name:
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:5] LOG:  received replication command: CREATE_REPLICATION_SLOT
"pg_basebackup_3682136"TEMPORARY PHYSICAL ( RESERVE_WAL)
 
which pg_basebackup builds using the backend pid:
        replication_slot = psprintf("pg_basebackup_%d", (int) PQbackendPID(param->bgconn));

The logs for that pid are:
2022-02-15 12:10:20.873 CET [620b8a1c.382f58:1] LOG:  connection received: host=[local]
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:2] LOG:  replication connection authorized: user=bf
application_name=019_replslot_limit.pl
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:3] LOG:  received replication command: SHOW data_directory_mode
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:4] STATEMENT:  SHOW data_directory_mode
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:5] LOG:  received replication command: CREATE_REPLICATION_SLOT
"pg_basebackup_3682136"TEMPORARY PHYSICAL ( RESERVE_WAL)
 
2022-02-15 12:10:20.874 CET [620b8a1c.382f58:6] STATEMENT:  CREATE_REPLICATION_SLOT "pg_basebackup_3682136" TEMPORARY
PHYSICAL( RESERVE_WAL)
 
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:7] LOG:  received replication command: IDENTIFY_SYSTEM
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:8] STATEMENT:  IDENTIFY_SYSTEM
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:9] LOG:  received replication command: START_REPLICATION SLOT
"pg_basebackup_3682136"0/600000 TIMELINE 1
 
2022-02-15 12:10:20.875 CET [620b8a1c.382f58:10] STATEMENT:  START_REPLICATION SLOT "pg_basebackup_3682136" 0/600000
TIMELINE1
 

Even though the node has log_disconnect = true, and other processes indeed log
their disconnection, there's no disconnect for the above session until the
server is shut down.  Even though pg_basebackup clearly finished? Uh, huh?

I guess it's conceivable that the backend was still working through process
shutdown? But it doesn't seem too likely, given that several other connections
manage to get through entire connect / disconnect cycles?



> Has anyone looked into these yet?

Hadn't yet...

Greetings,

Andres Freund

Re: Race conditions in 019_replslot_limit.pl

От

Kyotaro Horiguchi

Дата:

16 февраля 2022 г., 05:26:37

At Tue, 15 Feb 2022 15:51:57 -0800, Andres Freund <andres@anarazel.de> wrote in 
> I think what happened is that there was no WAL to receive between the start of
> the primary and the $node_primary3->wait_for_catchup($node_standby3);
> 
> Because the slot is created without reserving WAL that allows the primary to
> remove the WAL segments without ever creating a slot based conflict. I think
> that should be fixable by reserving the slot at creation time?

Agreed.  Doing this att all slot creation seems fine.

> > and:
> > 
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2022-02-15%2011%3A00%3A08
> > 
> > #   Failed test 'have walsender pid 3682154
> > # 3682136'
> > #   at t/019_replslot_limit.pl line 335.
> > #                   '3682154
> > # 3682136'
> > #     doesn't match '(?^:^[0-9]+$)'
> > 
> > The latter looks like there are two walsenders active, which confuses the
> > test.
> 
> Too bad there's no plain pid in the log_line_prefix. Looks like that used to be the
> buildfarm default, and I haven't fixed that animals configuration...
> 
> %c apparently is hex(process startup time).hex(pid) in hex, so we're looking
> for 382f58... Confirmed by the slot name:
> 2022-02-15 12:10:20.874 CET [620b8a1c.382f58:5] LOG:  received replication command: CREATE_REPLICATION_SLOT
"pg_basebackup_3682136"TEMPORARY PHYSICAL ( RESERVE_WAL)
 
> which pg_basebackup builds using the backend pid:
>         replication_slot = psprintf("pg_basebackup_%d", (int) PQbackendPID(param->bgconn));
> 
> The logs for that pid are:
> 2022-02-15 12:10:20.873 CET [620b8a1c.382f58:1] LOG:  connection received: host=[local]
> 2022-02-15 12:10:20.874 CET [620b8a1c.382f58:2] LOG:  replication connection authorized: user=bf
application_name=019_replslot_limit.pl
> 2022-02-15 12:10:20.874 CET [620b8a1c.382f58:3] LOG:  received replication command: SHOW data_directory_mode
> 2022-02-15 12:10:20.874 CET [620b8a1c.382f58:4] STATEMENT:  SHOW data_directory_mode
> 2022-02-15 12:10:20.874 CET [620b8a1c.382f58:5] LOG:  received replication command: CREATE_REPLICATION_SLOT
"pg_basebackup_3682136"TEMPORARY PHYSICAL ( RESERVE_WAL)
 
> 2022-02-15 12:10:20.874 CET [620b8a1c.382f58:6] STATEMENT:  CREATE_REPLICATION_SLOT "pg_basebackup_3682136" TEMPORARY
PHYSICAL( RESERVE_WAL)
 
> 2022-02-15 12:10:20.875 CET [620b8a1c.382f58:7] LOG:  received replication command: IDENTIFY_SYSTEM
> 2022-02-15 12:10:20.875 CET [620b8a1c.382f58:8] STATEMENT:  IDENTIFY_SYSTEM
> 2022-02-15 12:10:20.875 CET [620b8a1c.382f58:9] LOG:  received replication command: START_REPLICATION SLOT
"pg_basebackup_3682136"0/600000 TIMELINE 1
 
> 2022-02-15 12:10:20.875 CET [620b8a1c.382f58:10] STATEMENT:  START_REPLICATION SLOT "pg_basebackup_3682136" 0/600000
TIMELINE1
 
> 
> Even though the node has log_disconnect = true, and other processes indeed log
> their disconnection, there's no disconnect for the above session until the
> server is shut down.  Even though pg_basebackup clearly finished? Uh, huh?

It seems to me so, too.

> I guess it's conceivable that the backend was still working through process
> shutdown? But it doesn't seem too likely, given that several other connections
> manage to get through entire connect / disconnect cycles?

Yes, but since postmaster seems thinking that process is gone.


> > Has anyone looked into these yet?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Race conditions in 019_replslot_limit.pl

От

Masahiko Sawada

Дата:

16 февраля 2022 г., 05:58:23

On Wed, Feb 16, 2022 at 2:26 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Tue, 15 Feb 2022 15:51:57 -0800, Andres Freund <andres@anarazel.de> wrote in
> > I think what happened is that there was no WAL to receive between the start of
> > the primary and the $node_primary3->wait_for_catchup($node_standby3);
> >
> > Because the slot is created without reserving WAL that allows the primary to
> > remove the WAL segments without ever creating a slot based conflict. I think
> > that should be fixable by reserving the slot at creation time?
>
> Agreed.  Doing this att all slot creation seems fine.
>
> > > and:
> > >
> > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2022-02-15%2011%3A00%3A08
> > >
> > > #   Failed test 'have walsender pid 3682154
> > > # 3682136'
> > > #   at t/019_replslot_limit.pl line 335.
> > > #                   '3682154
> > > # 3682136'
> > > #     doesn't match '(?^:^[0-9]+$)'
> > >
> > > The latter looks like there are two walsenders active, which confuses the
> > > test.
> >
> > Too bad there's no plain pid in the log_line_prefix. Looks like that used to be the
> > buildfarm default, and I haven't fixed that animals configuration...
> >
> > %c apparently is hex(process startup time).hex(pid) in hex, so we're looking
> > for 382f58... Confirmed by the slot name:
> > 2022-02-15 12:10:20.874 CET [620b8a1c.382f58:5] LOG:  received replication command: CREATE_REPLICATION_SLOT
"pg_basebackup_3682136"TEMPORARY PHYSICAL ( RESERVE_WAL)
 
> > which pg_basebackup builds using the backend pid:
> >               replication_slot = psprintf("pg_basebackup_%d", (int) PQbackendPID(param->bgconn));
> >
> > The logs for that pid are:
> > 2022-02-15 12:10:20.873 CET [620b8a1c.382f58:1] LOG:  connection received: host=[local]
> > 2022-02-15 12:10:20.874 CET [620b8a1c.382f58:2] LOG:  replication connection authorized: user=bf
application_name=019_replslot_limit.pl
> > 2022-02-15 12:10:20.874 CET [620b8a1c.382f58:3] LOG:  received replication command: SHOW data_directory_mode
> > 2022-02-15 12:10:20.874 CET [620b8a1c.382f58:4] STATEMENT:  SHOW data_directory_mode
> > 2022-02-15 12:10:20.874 CET [620b8a1c.382f58:5] LOG:  received replication command: CREATE_REPLICATION_SLOT
"pg_basebackup_3682136"TEMPORARY PHYSICAL ( RESERVE_WAL)
 
> > 2022-02-15 12:10:20.874 CET [620b8a1c.382f58:6] STATEMENT:  CREATE_REPLICATION_SLOT "pg_basebackup_3682136"
TEMPORARYPHYSICAL ( RESERVE_WAL)
 
> > 2022-02-15 12:10:20.875 CET [620b8a1c.382f58:7] LOG:  received replication command: IDENTIFY_SYSTEM
> > 2022-02-15 12:10:20.875 CET [620b8a1c.382f58:8] STATEMENT:  IDENTIFY_SYSTEM
> > 2022-02-15 12:10:20.875 CET [620b8a1c.382f58:9] LOG:  received replication command: START_REPLICATION SLOT
"pg_basebackup_3682136"0/600000 TIMELINE 1
 
> > 2022-02-15 12:10:20.875 CET [620b8a1c.382f58:10] STATEMENT:  START_REPLICATION SLOT "pg_basebackup_3682136"
0/600000TIMELINE 1
 
> >
> > Even though the node has log_disconnect = true, and other processes indeed log
> > their disconnection, there's no disconnect for the above session until the
> > server is shut down.  Even though pg_basebackup clearly finished? Uh, huh?
>
> It seems to me so, too.
>
> > I guess it's conceivable that the backend was still working through process
> > shutdown? But it doesn't seem too likely, given that several other connections
> > manage to get through entire connect / disconnect cycles?
>
> Yes, but since postmaster seems thinking that process is gone.

Or it's possible that the process took a time to clean up the
temporary replication slot?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Race conditions in 019_replslot_limit.pl

От

Kyotaro Horiguchi

Дата:

16 февраля 2022 г., 06:01:19

At Wed, 16 Feb 2022 14:26:37 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> Agreed.  Doing this att all slot creation seems fine.

Done in the attached. The first slot is deliberately created
unreserved so I changed the code to re-create the slot with "reserved"
before taking backup.

> > Even though the node has log_disconnect = true, and other processes indeed log
> > their disconnection, there's no disconnect for the above session until the
> > server is shut down.  Even though pg_basebackup clearly finished? Uh, huh?
> 
> It seems to me so, too.
> 
> > I guess it's conceivable that the backend was still working through process
> > shutdown? But it doesn't seem too likely, given that several other connections
> > manage to get through entire connect / disconnect cycles?
> 
> Yes, but since postmaster seems thinking that process is gone.

s/ since//;

Whatever is happening at that time, I can make sure that walsender is
gone before making a new replication connection, even though it
doesn't "fix" any of the observed issues.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index 4257bd4d35..059003d63a 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -35,6 +35,11 @@ my $result = $node_primary->safe_psql('postgres',
 );
 is($result, "t|t|t", 'check the state of non-reserved slot is "unknown"');
 
+# re-create reserved replication slot before taking backup
+$node_primary->safe_psql('postgres', q[
+    SELECT pg_drop_replication_slot('rep1');
+    SELECT pg_create_physical_replication_slot('rep1', true);
+]);
 
 # Take backup
 my $backup_name = 'my_backup';
@@ -265,7 +270,7 @@ log_checkpoints = yes
 ));
 $node_primary2->start;
 $node_primary2->safe_psql('postgres',
-    "SELECT pg_create_physical_replication_slot('rep1')");
+    "SELECT pg_create_physical_replication_slot('rep1', true)");
 $backup_name = 'my_backup2';
 $node_primary2->backup($backup_name);
 
@@ -319,7 +324,7 @@ $node_primary3->append_conf(
     ));
 $node_primary3->start;
 $node_primary3->safe_psql('postgres',
-    "SELECT pg_create_physical_replication_slot('rep3')");
+    "SELECT pg_create_physical_replication_slot('rep3', true)");
 # Take backup
 $backup_name = 'my_backup';
 $node_primary3->backup($backup_name);
@@ -327,6 +332,14 @@ $node_primary3->backup($backup_name);
 my $node_standby3 = PostgreSQL::Test::Cluster->new('standby_3');
 $node_standby3->init_from_backup($node_primary3, $backup_name,
     has_streaming => 1);
+
+# We will check for walsender process just after. Make sure no
+# walsenders will stay sitting.
+$node_primary3->poll_query_until('postgres',
+    "SELECT count(*) = 0 FROM pg_stat_activity WHERE backend_type = 'walsender'",
+    "t")
+  or die "timed out waiting for wealsender to get out";
+
 $node_standby3->append_conf('postgresql.conf', "primary_slot_name = 'rep3'");
 $node_standby3->start;
 $node_primary3->wait_for_catchup($node_standby3);

Re: Race conditions in 019_replslot_limit.pl

От

Kyotaro Horiguchi

Дата:

16 февраля 2022 г., 06:09:38

At Wed, 16 Feb 2022 14:58:23 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in 
> On Wed, Feb 16, 2022 at 2:26 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> >
> > At Tue, 15 Feb 2022 15:51:57 -0800, Andres Freund <andres@anarazel.de> wrote in
> > > I guess it's conceivable that the backend was still working through process
> > > shutdown? But it doesn't seem too likely, given that several other connections
> > > manage to get through entire connect / disconnect cycles?
> >
> > Yes, but since postmaster seems thinking that process is gone.
> 
> Or it's possible that the process took a time to clean up the
> temporary replication slot?

Ugg, it's immediate shutdown.  So postmaster kills the walsender no
matter what state the waldender is under, nd leaves no log about the
end ofwalsender.

Sorry for the confusion.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Race conditions in 019_replslot_limit.pl

От

Kyotaro Horiguchi

Дата:

16 февраля 2022 г., 06:22:27

At Wed, 16 Feb 2022 14:58:23 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in 
> Or it's possible that the process took a time to clean up the
> temporary replication slot?

Checkpointer may take ReplicationSlotControlLock. Dead lock between
ReplicationSlotCleanup and InvalidateObsoleteReplicationSlots
happened?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Race conditions in 019_replslot_limit.pl

От

Kyotaro Horiguchi

Дата:

16 февраля 2022 г., 07:12:24

At Wed, 16 Feb 2022 15:22:27 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Wed, 16 Feb 2022 14:58:23 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in 
> > Or it's possible that the process took a time to clean up the
> > temporary replication slot?
> 
> Checkpointer may take ReplicationSlotControlLock. Dead lock between
> ReplicationSlotCleanup and InvalidateObsoleteReplicationSlots
> happened?

Or missing CV broadcast?  Anyway I haven't find a scenario of
interfering between checkpointer and walsender.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Race conditions in 019_replslot_limit.pl

От

Masahiko Sawada

Дата:

16 февраля 2022 г., 09:04:14

On Wed, Feb 16, 2022 at 3:22 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Wed, 16 Feb 2022 14:58:23 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
> > Or it's possible that the process took a time to clean up the
> > temporary replication slot?
>
> Checkpointer may take ReplicationSlotControlLock. Dead lock between
> ReplicationSlotCleanup and InvalidateObsoleteReplicationSlots
> happened?

That's possible. Whatever the exact cause of this failure, I think we
can stabilize this test by adding a condition of application_name to
the query.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Race conditions in 019_replslot_limit.pl

От

Tom Lane

Дата:

17 февраля 2022 г., 01:22:23

Andres Freund <andres@anarazel.de> writes:
> I think the test is telling us that something may be broken. We shouldn't
> silence that without at least some understanding what it is.

I looked at the recent failure on komodoensis [1], and I think what is
happening is just that the walsender for the basebackup run (launched
at 019_replslot_limit.pl line 325) hasn't exited yet at the point where
we do a blind
"SELECT pid FROM pg_stat_activity WHERE backend_type = 'walsender'"
and expect that we're only going to see the walsender launched for
the standby at line 331.  The two PIDs reported in the failure
correspond to this postmaster log trace:

2022-02-16 23:06:29.596 CET [620d7565.38dd62:1] LOG:  connection received: host=[local]
2022-02-16 23:06:29.596 CET [620d7565.38dd62:2] LOG:  replication connection authorized: user=bf
application_name=019_replslot_limit.pl
2022-02-16 23:06:29.596 CET [620d7565.38dd62:3] LOG:  received replication command: SHOW data_directory_mode
2022-02-16 23:06:29.596 CET [620d7565.38dd62:4] STATEMENT:  SHOW data_directory_mode
2022-02-16 23:06:29.596 CET [620d7565.38dd62:5] LOG:  received replication command: CREATE_REPLICATION_SLOT
"pg_basebackup_3726690"TEMPORARY PHYSICAL ( RESERVE_WAL) 
2022-02-16 23:06:29.596 CET [620d7565.38dd62:6] STATEMENT:  CREATE_REPLICATION_SLOT "pg_basebackup_3726690" TEMPORARY
PHYSICAL( RESERVE_WAL) 
2022-02-16 23:06:29.597 CET [620d7565.38dd62:7] LOG:  received replication command: IDENTIFY_SYSTEM
2022-02-16 23:06:29.597 CET [620d7565.38dd62:8] STATEMENT:  IDENTIFY_SYSTEM
2022-02-16 23:06:29.597 CET [620d7565.38dd62:9] LOG:  received replication command: START_REPLICATION SLOT
"pg_basebackup_3726690"0/600000 TIMELINE 1 
2022-02-16 23:06:29.597 CET [620d7565.38dd62:10] STATEMENT:  START_REPLICATION SLOT "pg_basebackup_3726690" 0/600000
TIMELINE1 

and this one:

2022-02-16 23:06:29.687 CET [620d7565.38dd6f:1] LOG:  connection received: host=[local]
2022-02-16 23:06:29.687 CET [620d7565.38dd6f:2] LOG:  replication connection authorized: user=bf
application_name=standby_3
2022-02-16 23:06:29.687 CET [620d7565.38dd6f:3] LOG:  received replication command: IDENTIFY_SYSTEM
2022-02-16 23:06:29.687 CET [620d7565.38dd6f:4] STATEMENT:  IDENTIFY_SYSTEM
2022-02-16 23:06:29.687 CET [620d7565.38dd6f:5] LOG:  received replication command: START_REPLICATION SLOT "rep3"
0/700000TIMELINE 1 
2022-02-16 23:06:29.687 CET [620d7565.38dd6f:6] STATEMENT:  START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1

There's no disconnection log entry for either, which I suppose means
that somebody didn't bother logging disconnection for walsenders ...
shouldn't we fix that?  But in any case, I don't see anything
interesting here, just a query that needs to be more selective.
Perhaps we can look for application_name=standby_3?

            regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=komodoensis&dt=2022-02-16%2021%3A16%3A04

Re: Race conditions in 019_replslot_limit.pl

От

Tom Lane

Дата:

17 февраля 2022 г., 03:11:30

Andres Freund <andres@anarazel.de> writes:
> On 2022-02-16 20:22:23 -0500, Tom Lane wrote:
>> There's no disconnection log entry for either, which I suppose means
>> that somebody didn't bother logging disconnection for walsenders ...

> The thing is, we actually *do* log disconnection for walsenders:

Ah, my mistake, now I do see a disconnection entry for the other walsender
launched by the basebackup.

> Starting a node in recovery and having it connect to the primary seems like a
> mighty long time for a process to exit, unless it's stuck behind something.

Fair point.  Also, 019_replslot_limit.pl hasn't been changed in any
material way in months, but *something's* changed recently, because
this just started.  I scraped the buildfarm for instances of
"Failed test 'have walsender pid" going back 6 months, and what I find is

   sysname    | branch |      snapshot       |     stage     |                      l
--------------+--------+---------------------+---------------+---------------------------------------------
 desmoxytes   | HEAD   | 2022-02-15 04:42:05 | recoveryCheck | #   Failed test 'have walsender pid 1685516
 idiacanthus  | HEAD   | 2022-02-15 07:24:05 | recoveryCheck | #   Failed test 'have walsender pid 2758549
 serinus      | HEAD   | 2022-02-15 11:00:08 | recoveryCheck | #   Failed test 'have walsender pid 3682154
 desmoxytes   | HEAD   | 2022-02-15 11:04:05 | recoveryCheck | #   Failed test 'have walsender pid 3775359
 flaviventris | HEAD   | 2022-02-15 18:03:48 | recoveryCheck | #   Failed test 'have walsender pid 1517077
 idiacanthus  | HEAD   | 2022-02-15 22:48:05 | recoveryCheck | #   Failed test 'have walsender pid 2494972
 desmoxytes   | HEAD   | 2022-02-15 23:48:04 | recoveryCheck | #   Failed test 'have walsender pid 3055399
 desmoxytes   | HEAD   | 2022-02-16 10:48:05 | recoveryCheck | #   Failed test 'have walsender pid 1593461
 komodoensis  | HEAD   | 2022-02-16 21:16:04 | recoveryCheck | #   Failed test 'have walsender pid 3726703
 serinus      | HEAD   | 2022-02-17 01:18:17 | recoveryCheck | #   Failed test 'have walsender pid 208363

So (a) it broke around 48 hours ago, which is already a useful
bit of info, and (b) your animals seem far more susceptible than
anyone else's.  Why do you suppose that is?

            regards, tom lane

Re: Race conditions in 019_replslot_limit.pl

От

Tom Lane

Дата:

17 февраля 2022 г., 03:15:47

I wrote:
> So (a) it broke around 48 hours ago, which is already a useful
> bit of info, and (b) your animals seem far more susceptible than
> anyone else's.  Why do you suppose that is?

Eyeballing the commit log for potential causes in that time frame,
I can't help noticing

2 days ago    Andres Freund    Move replication slot release to before_shmem_exit().

            regards, tom lane

Re: Race conditions in 019_replslot_limit.pl

От

Tom Lane

Дата:

17 февраля 2022 г., 03:58:19

Andres Freund <andres@anarazel.de> writes:
> But right now I'm not seeing what prevents us from throwing a FATAL error
> while holding an lwlock?

If we're inside a transaction, then ShutdownPostgres ->
AbortOutOfAnyTransaction would release LWLocks.  I wonder
though if an exiting walsender would always take that path.
Maybe it'd be good if we released all LWLocks even if we
don't think we're in a transaction?

More generally, it wasn't obvious to me why you thought that
BaseInit was a good spot to place the on_shmem_exit call.
It obviously can't be *before* that because of the pgstats
dependency, but maybe it should be later?

            regards, tom lane

Re: Race conditions in 019_replslot_limit.pl

От

Tom Lane

Дата:

18 февраля 2022 г., 04:11:53

Andres Freund <andres@anarazel.de> writes:
> There've not been any new reports in the last 18 hours, but that's probably
> just due to lower commit activity triggering fewer runs.

myna just showed the same symptom:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=myna&dt=2022-02-18%2003%3A00%3A17

#   Failed test 'have walsender pid 13681
# 13670'
#   at t/019_replslot_limit.pl line 335.
#                   '13681
# 13670'
#     doesn't match '(?^:^[0-9]+$)'
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 18.
[03:13:09] t/019_replslot_limit.pl ..............

I think that's the first one on something outside your menagerie.

            regards, tom lane

Re: Race conditions in 019_replslot_limit.pl

От

Tom Lane

Дата:

18 февраля 2022 г., 23:15:21

Andres Freund <andres@anarazel.de> writes:
> On 2022-02-17 21:55:21 -0800, Andres Freund wrote:
>> Isn't it pretty bonkers that we allow error processing to get stuck behind
>> network traffic, *before* we have have released resources (locks etc)?

It's more or less intentional, per elog.c:

    /*
     * This flush is normally not necessary, since postgres.c will flush out
     * waiting data when control returns to the main loop. But it seems best
     * to leave it here, so that the client has some clue what happened if the
     * backend dies before getting back to the main loop ... error/notice
     * messages should not be a performance-critical path anyway, so an extra
     * flush won't hurt much ...
     */
    pq_flush();

Perhaps it'd be sensible to do this only in debugging (ie Assert)
builds?

            regards, tom lane

Re: Race conditions in 019_replslot_limit.pl

От

Tom Lane

Дата:

18 февраля 2022 г., 23:49:14

Andres Freund <andres@anarazel.de> writes:
> On 2022-02-18 18:15:21 -0500, Tom Lane wrote:
>> Perhaps it'd be sensible to do this only in debugging (ie Assert)
>> builds?

> That seems not great, because it pretty clearly can lead to hangs, which is
> problematic in tests too. What about using pq_flush_if_writable()? In nearly
> all situations that'd still push the failure to the client.

That'd be okay by me.

> We'd also need to add pq_endmessage_noblock(), because the pq_endmessage()
> obviously tries to send (as in the backtrace upthread) if the output buffer is
> large enough, which it often will be in walsender.

I don't see that as "obvious".  If we're there, we do not have an
error situation.

> I guess we could try to flush in a blocking manner sometime later in the
> shutdown sequence, after we've released resources? But I'm doubtful it's a
> good idea, we don't really want to block waiting to exit when e.g. the network
> connection is dead without the TCP stack knowing.

I think you are trying to move in the direction of possibly exiting
without ever sending at all, which does NOT seem like an improvement.

            regards, tom lane

Re: Race conditions in 019_replslot_limit.pl

От

Andres Freund

Дата:

25 февраля 2022 г., 19:29:41

Hi,

On 2022-02-22 18:06:24 -0800, Andres Freund wrote:
> On 2022-02-18 15:14:15 -0800, Andres Freund wrote:
> > I'm running out of ideas for how to try to reproduce this. I think we might
> > need some additional debugging information to get more information from the
> > buildfarm.
> 
> > I'm thinking of adding log_min_messages=DEBUG2 to primary3, passing --verbose
> > to pg_basebackup in $node_primary3->backup(...).
> >
> > It might also be worth adding DEBUG2 messages to ReplicationSlotShmemExit(),
> > ReplicationSlotCleanup(), InvalidateObsoleteReplicationSlots().
> 
> Planning to commit something like the attached.

This did provide us a bit more detail.

Seems to suggest something is holding a problematic lock in a way that I do not understand yet:

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2022-02-23%2013%3A47%3A20&stg=recovery-check
2022-02-23 09:09:52.299 EST [2022-02-23 09:09:52 EST 1997084:6] 019_replslot_limit.pl LOG:  received replication
command:CREATE_REPLICATION_SLOT "pg_basebackup_1997084" TEMPORARY PHYSICAL ( RESERVE_WAL)
 
...
2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:14] 019_replslot_limit.pl DEBUG:  shmem_exit(0): 4
before_shmem_exitcallbacks to make
 
2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:15] 019_replslot_limit.pl DEBUG:  replication slot exit
hook,without active slot
 
2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:16] 019_replslot_limit.pl DEBUG:  temporary replication
slotcleanup: begin
 

last message from 1997084 until the immediate shutdown. Just checking for
temporary replication slots that need to be dropped requires
ReplicationSlotControlLock. Actually dropping also requires
ReplicationSlotAllocationLock

1997084 does have to a temporary replication slot to clean up.


2022-02-23 09:09:52.519 EST [2022-02-23 09:09:52 EST 1997083:35] 019_replslot_limit.pl DEBUG:  shmem_exit(0): 4
before_shmem_exitcallbacks to make
 
2022-02-23 09:09:52.519 EST [2022-02-23 09:09:52 EST 1997083:36] 019_replslot_limit.pl DEBUG:  replication slot exit
hook,without active slot
 
2022-02-23 09:09:52.519 EST [2022-02-23 09:09:52 EST 1997083:37] 019_replslot_limit.pl DEBUG:  temporary replication
slotcleanup: begin
 
2022-02-23 09:09:52.519 EST [2022-02-23 09:09:52 EST 1997083:38] 019_replslot_limit.pl DEBUG:  temporary replication
slotcleanup: done
 

1997083 succeeds in doing the cleanup. It does not have a temporary
replication slot. Making it look like somehow ReplicationSlotAllocationLock
hasn't been released.


2022-02-23 09:09:52.519 EST [2022-02-23 09:09:52 EST 1997083:39] 019_replslot_limit.pl DEBUG:  shmem_exit(0): 7
on_shmem_exitcallbacks to make
 
...
2022-02-23 09:09:53.076 EST [2022-02-23 09:09:52 EST 1997072:87] LOG:  received immediate shutdown request
...
2022-02-23 09:09:53.095 EST [2022-02-23 09:09:52 EST 1997072:90] DEBUG:  server process (PID 1997084) exited with exit
code2
 

It's *possible*, but not likely, that somehow 1997084 just doesn't get
scheduled for a prolonged amount of time.


We could be more certain if we shut down the cluster in fast rather than
immediate mode. So I'm thinking of doing something like

# We've seen occasionales cases where multiple walsender pids are active. An
# immediate shutdown may hide evidence of a locking bug. So if multiple
# walsenders are observed, shut down in fast mode, and collect some more
# information.
if (not like($senderpid, qr/^[0-9]+$/, "have walsender pid $senderpid"))
{
    my ($stdout, $stderr);
    $node_primary3->psql('postgres',
                         "\\a\\t\nSELECT * FROM pg_stat_activity",
                         stdout => \$stdout, stderr => \$stderr);
    diag $stdout, $stderr;
    $node_primary3->stop('fast');
    $node_standby3->stop('fast');
    die "could not determine walsender pid, can't continue";
}

Does that make sense? Better ideas?

Greetings,

Andres Freund

Re: Race conditions in 019_replslot_limit.pl

От

Tom Lane

Дата:

25 февраля 2022 г., 20:07:01

Andres Freund <andres@anarazel.de> writes:
> Seems to suggest something is holding a problematic lock in a way that I do not understand yet:

> https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2022-02-23%2013%3A47%3A20&stg=recovery-check
> 2022-02-23 09:09:52.299 EST [2022-02-23 09:09:52 EST 1997084:6] 019_replslot_limit.pl LOG:  received replication
command:CREATE_REPLICATION_SLOT "pg_basebackup_1997084" TEMPORARY PHYSICAL ( RESERVE_WAL) 
> ...
> 2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:14] 019_replslot_limit.pl DEBUG:  shmem_exit(0): 4
before_shmem_exitcallbacks to make 
> 2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:15] 019_replslot_limit.pl DEBUG:  replication slot exit
hook,without active slot 
> 2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:16] 019_replslot_limit.pl DEBUG:  temporary replication
slotcleanup: begin 

> last message from 1997084 until the immediate shutdown.

Hmm.  Maybe put a couple more debug messages into ReplicationSlotCleanup
and/or ReplicationSlotDropPtr?  It doesn't seem very clear where in that
sequence it's hanging up.

> We could be more certain if we shut down the cluster in fast rather than
> immediate mode. So I'm thinking of doing something like

Does that risk an indefinite hangup of the buildfarm run?

            regards, tom lane

Re: Race conditions in 019_replslot_limit.pl

От

Andres Freund

Дата:

25 февраля 2022 г., 20:15:58

Hi,

On 2022-02-25 15:07:01 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Seems to suggest something is holding a problematic lock in a way that I do not understand yet:
> 
> >
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2022-02-23%2013%3A47%3A20&stg=recovery-check
> > 2022-02-23 09:09:52.299 EST [2022-02-23 09:09:52 EST 1997084:6] 019_replslot_limit.pl LOG:  received replication
command:CREATE_REPLICATION_SLOT "pg_basebackup_1997084" TEMPORARY PHYSICAL ( RESERVE_WAL)
 
> > ...
> > 2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:14] 019_replslot_limit.pl DEBUG:  shmem_exit(0): 4
before_shmem_exitcallbacks to make
 
> > 2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:15] 019_replslot_limit.pl DEBUG:  replication slot
exithook, without active slot
 
> > 2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:16] 019_replslot_limit.pl DEBUG:  temporary
replicationslot cleanup: begin
 
> 
> > last message from 1997084 until the immediate shutdown.
> 
> Hmm.  Maybe put a couple more debug messages into ReplicationSlotCleanup
> and/or ReplicationSlotDropPtr?  It doesn't seem very clear where in that
> sequence it's hanging up.

Yea, was thinking that as well.


I'm also wondering whether it's worth adding an assert, or at least a WARNING,
about no lwlocks held to the tail end of ShutdownPostgres?  I don't want to
add an LWLockReleaseAll() yet, before I understand what's actually happening.


> > We could be more certain if we shut down the cluster in fast rather than
> > immediate mode. So I'm thinking of doing something like
> 
> Does that risk an indefinite hangup of the buildfarm run?

I think not. The pg_ctl stop -m fast should time out after PGCTLTIMEOUT,
$self->_update_pid(-1); should notice it's not dead. The END{} block should
then shut it down in immediate mode.

Greetings,

Andres Freund

Re: Race conditions in 019_replslot_limit.pl

От

Andres Freund

Дата:

26 февраля 2022 г., 07:23:25

Hi,

On 2022-02-25 12:15:58 -0800, Andres Freund wrote:
> On 2022-02-25 15:07:01 -0500, Tom Lane wrote:
> > Andres Freund <andres@anarazel.de> writes:
> > Hmm.  Maybe put a couple more debug messages into ReplicationSlotCleanup
> > and/or ReplicationSlotDropPtr?  It doesn't seem very clear where in that
> > sequence it's hanging up.
> 
> Yea, was thinking that as well.
> 
> 
> I'm also wondering whether it's worth adding an assert, or at least a WARNING,
> about no lwlocks held to the tail end of ShutdownPostgres?  I don't want to
> add an LWLockReleaseAll() yet, before I understand what's actually happening.

Did those things. Sure looks like there's some interaction with
checkpointer. There's a similar sequence in the two failures since the
additional debugging.

2022-02-26 06:35:52.453 CET [6219bc37.8717f:20] DEBUG:  replication slot drop: pg_basebackup_553343: removed on-disk
2022-02-26 06:35:52.453 CET [6219bc37.87168:17] DEBUG:  snapshot of 0+0 running transaction ids (lsn 0/700060 oldest
xid720 latest complete 719 next xid 720)
 
2022-02-26 06:35:52.453 CET [6219bc37.87168:18] DEBUG:  begin invalidating obsolete replication slots older than
0/700000
2022-02-26 06:35:52.453 CET [6219bc37.87168:19] DEBUG:  done invalidating obsolete replication slots
2022-02-26 06:35:52.453 CET [6219bc37.87168:20] DEBUG:  attempting to remove WAL segments older than log file
000000000000000000000006
2022-02-26 06:35:52.453 CET [6219bc37.87168:21] DEBUG:  removing write-ahead log file "000000010000000000000006"
2022-02-26 06:35:52.453 CET [6219bc37.87168:22] DEBUG:  SlruScanDirectory invoking callback on pg_subtrans/0000
2022-02-26 06:35:52.453 CET [6219bc37.87168:23] LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added,
1removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.351 s; sync files=0, longest=0.000 s, average=0.000 s;
distance=1024kB, estimate=1024 kB
 
[...]
2022-02-26 06:35:52.970 CET [6219bc37.87139:71] LOG:  received fast shutdown request
2022-02-26 06:35:52.970 CET [6219bc37.87139:72] LOG:  aborting any active transactions
[...]
2022-02-26 06:35:52.971 CET [6219bc37.87168:24] LOG:  shutting down
2022-02-26 06:35:52.976 CET [6219bc37.8717f:21] DEBUG:  replication slot drop: pg_basebackup_553343: done
2022-02-26 06:35:52.976 CET [6219bc37.8717f:22] DEBUG:  temporary replication slot cleanup: begin
2022-02-26 06:35:52.976 CET [6219bc37.8717f:23] DEBUG:  temporary replication slot cleanup: 0 in use, active_pid:
553385
2022-02-26 06:35:52.976 CET [6219bc37.8717f:24] DEBUG:  temporary replication slot cleanup: done
2022-02-26 06:35:52.977 CET [6219bc37.8717f:25] DEBUG:  shmem_exit(0): 7 on_shmem_exit callbacks to make
2022-02-26 06:35:52.977 CET [6219bc37.8717f:26] DEBUG:  proc_exit(0): 3 callbacks to make
2022-02-26 06:35:52.977 CET [6219bc37.8717f:27] LOG:  disconnection: session time: 0:00:00.996 user=bf database=
host=[local]
2022-02-26 06:35:52.977 CET [6219bc37.8717f:28] DEBUG:  exit(0)
2022-02-26 06:35:52.977 CET [6219bc37.8717f:29] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2022-02-26 06:35:52.977 CET [6219bc37.8717f:30] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2022-02-26 06:35:52.977 CET [6219bc37.8717f:31] DEBUG:  proc_exit(-1): 0 callbacks to make

So the backend (8717f/553343) is in the middle of ReplicationSlotDrop(), after
removing on-disk data. And then for 500ms nothing happens until checkpointer
wakes up again. As soon as it does, the slot drop continues.

Just before calling ReplicationSlotDrop() we were able to acquire
ReplicationSlotControlLock in share mode. Just after the log message after
which there's the delay, ReplicationSlotControlLock is locked in exclusive
mode.

Too tired now..

Greetings,

Andres Freund

Re: Race conditions in 019_replslot_limit.pl

От

Tom Lane

Дата:

24 марта 2022 г., 19:05:40

I just noticed something very interesting: in a couple of recent
buildfarm runs with this failure, the pg_stat_activity printout
no longer shows the extra walsender:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2022-03-24%2017%3A50%3A10
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=xenodermus&dt=2022-03-23%2011%3A00%3A05

This is just two of the 33 such failures in the past ten days,
so maybe it's not surprising that we didn't see it already.
(I got bored before looking back further than that.)

What this suggests to me is that maybe the extra walsender is
indeed not blocked on anything, but is just taking its time
about exiting.  In these two runs, as well as in all the
non-failing runs, it had enough time to do so.

I suggest that we add a couple-of-seconds sleep in front of
the query that collects walsender PIDs, and maybe a couple more
seconds before the pg_stat_activity probe in the failure path,
and see if the behavior changes at all.  That should be enough
to confirm or disprove this idea pretty quickly.  If it is
right, a permanent fix could be to wait for the basebackup's
walsender to disappear from node_primary3's pg_stat_activity
before we start the one for node_standby3.

            regards, tom lane

Re: Race conditions in 019_replslot_limit.pl

От

Andres Freund

Дата:

25 марта 2022 г., 00:02:34

Hi,

On 2022-03-24 15:05:40 -0400, Tom Lane wrote:
> I just noticed something very interesting: in a couple of recent
> buildfarm runs with this failure, the pg_stat_activity printout
> no longer shows the extra walsender:

Oh. That is interesting. Thanks for catching that.

> What this suggests to me is that maybe the extra walsender is
> indeed not blocked on anything, but is just taking its time
> about exiting.  In these two runs, as well as in all the
> non-failing runs, it had enough time to do so.

Still odd that it started with the bugfix I committed. And the locking pattern
in [1] still seems suspicious. But the obove does seem to suggest those might
just have been red herrings.

I had previously tried to make process exit slow by inserting delays in
various points and only succeeded in tests failing due too many
connections. For some reason I didn't increase max_connections...

I now increased the test's max_connections and managed to get a kind of
similar failure by sticking
                elog(LOG, "received disconnect");
                pg_usleep(220000);
into the 'X' case in PostgresMain().

However, I always see three pids or the expected one pid, not two like the BF
cases. But it's not too surprising that such an artificial, uniform, slowdown
would have different symptoms than reality.

> I suggest that we add a couple-of-seconds sleep in front of
> the query that collects walsender PIDs, and maybe a couple more
> seconds before the pg_stat_activity probe in the failure path,
> and see if the behavior changes at all.

Makes sense. I was previously thinking that it'd make sense to sleep for a bit
before the ->stop('fast'); calls, so that we can see whether it's just the
shutdown unblocking things (which [1] seems to suggest).

How about the attached variation, which retries (for 15s, with 100ms sleeps)
if there are multiple walsenders, printing the debugging info each time? It'll
still fail the test if later iterations find just one walsender, which seems
the right behaviour for now.

> That should be enough to confirm or disprove this idea pretty quickly.  If
> it is right, a permanent fix could be to wait for the basebackup's walsender
> to disappear from node_primary3's pg_stat_activity before we start the one
> for node_standby3.

For some tests a "confimed disconnect" mode would be useful. Basically the
client waiting until it receives EOF after sending 'X'.

Greetings,

Andres Freund

[1] https://postgr.es/m/20220226072325.wtvkwvvga2wc3nkn%40alap3.anarazel.de

Вложения

019_replslot_limit_debug.diff

Re: Race conditions in 019_replslot_limit.pl

От

Tom Lane

Дата:

25 марта 2022 г., 00:08:09

Andres Freund <andres@anarazel.de> writes:
> How about the attached variation, which retries (for 15s, with 100ms sleeps)
> if there are multiple walsenders, printing the debugging info each time? It'll
> still fail the test if later iterations find just one walsender, which seems
> the right behaviour for now.

OK by me.

            regards, tom lane

Re: Race conditions in 019_replslot_limit.pl

От

Tom Lane

Дата:

27 марта 2022 г., 19:28:05

Andres Freund <andres@anarazel.de> writes:
> How about the attached variation, which retries (for 15s, with 100ms sleeps)
> if there are multiple walsenders, printing the debugging info each time? It'll
> still fail the test if later iterations find just one walsender, which seems
> the right behaviour for now.

We have now four instances of failures with this version of the test,
and in every case the second iteration succeeded.  Is that enough
evidence yet?

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2022-03-27%2017%3A04%3A18
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2022-03-25%2009%3A00%3A09
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2022-03-25%2008%3A02%3A05
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2022-03-25%2003%3A00%3A18

I'd like to silence this noise so that we can start tracking
lower-probability failure modes, like say these:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=olingo&dt=2022-03-26%2002%3A59%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-03-26%2015%3A53%3A51

            regards, tom lane

Re: Race conditions in 019_replslot_limit.pl

От

Andres Freund

Дата:

27 марта 2022 г., 21:32:19

Hi,

On 2022-03-27 15:28:05 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > How about the attached variation, which retries (for 15s, with 100ms sleeps)
> > if there are multiple walsenders, printing the debugging info each time? It'll
> > still fail the test if later iterations find just one walsender, which seems
> > the right behaviour for now.
> 
> We have now four instances of failures with this version of the test,
> and in every case the second iteration succeeded.  Is that enough
> evidence yet?

I still feel like there's something off here. But that's probably not enough
to keep causing failures. I'm inclined to leave the debugging in for a bit
longer, but not fail the test anymore?


The way the temporary slot removal hangs for a while seems just off:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2022-03-27%2017%3A04%3A18
  2022-03-27 13:30:56.993 EDT [2022-03-27 13:30:56 EDT 750695:20] 019_replslot_limit.pl DEBUG:  replication slot drop:
pg_basebackup_750695:removed on-disk
 
  ...
  2022-03-27 13:30:57.456 EDT [2022-03-27 13:30:57 EDT 750759:3] [unknown] LOG:  connection authorized: user=andrew
database=postgresapplication_name=019_replslot_limit.pl
 
  2022-03-27 13:30:57.466 EDT [2022-03-27 13:30:57 EDT 750759:4] 019_replslot_limit.pl LOG:  statement: SELECT * FROM
pg_stat_activity
  .
  2022-03-27 13:30:57.474 EDT [2022-03-27 13:30:56 EDT 750679:87] DEBUG:  server process (PID 750759) exited with exit
code0
 
  2022-03-27 13:30:57.507 EDT [2022-03-27 13:30:56 EDT 750695:21] 019_replslot_limit.pl DEBUG:  replication slot drop:
pg_basebackup_750695:done
 

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2022-03-25%2009%3A00%3A09
  2022-03-25 10:13:30.364 CET [4022819][walsender][4/0:0][019_replslot_limit.pl] DEBUG:  replication slot drop:
pg_basebackup_4022819:begin
 
  2022-03-25 10:13:30.364 CET [4022819][walsender][4/0:0][019_replslot_limit.pl] DEBUG:  replication slot drop:
pg_basebackup_4022819:removed on-disk
 
  ...
  2022-03-25 10:13:31.038 CET [4022841][client backend][5/7:0][[unknown]] LOG:  connection authorized: user=bf
database=postgresapplication_name=019_replslot_limit.pl
 
  2022-03-25 10:13:31.039 CET [4022841][client backend][5/8:0][019_replslot_limit.pl] LOG:  statement: SELECT * FROM
pg_stat_activity
  ...
  2022-03-25 10:13:31.045 CET [4022807][postmaster][:0][] DEBUG:  server process (PID 4022841) exited with exit code 0
  2022-03-25 10:13:31.056 CET [4022819][walsender][4/0:0][019_replslot_limit.pl] DEBUG:  replication slot drop:
pg_basebackup_4022819:done
 

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2022-03-25%2008%3A02%3A05
  2022-03-25 09:15:20.558 CET [3730425][walsender][4/0:0][019_replslot_limit.pl] DEBUG:  replication slot drop:
pg_basebackup_3730425:removed on-disk
 
  ...
  2022-03-25 09:15:20.803 CET [3730461][client backend][5/7:0][[unknown]] LOG:  connection authorized: user=bf
database=postgresapplication_name=019_replslot_limit.pl
 
  2022-03-25 09:15:20.804 CET [3730461][client backend][5/8:0][019_replslot_limit.pl] LOG:  statement: SELECT * FROM
pg_stat_activity
  ...
  2022-03-25 09:15:20.834 CET [3730381][postmaster][:0][] DEBUG:  server process (PID 3730461) exited with exit code 0
  2022-03-25 09:15:20.861 CET [3730425][walsender][4/0:0][019_replslot_limit.pl] DEBUG:  replication slot drop:
pg_basebackup_3730425:done
 

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2022-03-25%2003%3A00%3A18
  2022-03-25 04:14:03.025 CET [2674398][walsender][4/0:0][019_replslot_limit.pl] DEBUG:  replication slot drop:
pg_basebackup_2674398:removed on-disk
 
  ...
  2022-03-25 04:14:03.264 CET [2674463][client backend][5/7:0][[unknown]] LOG:  connection authorized: user=bf
database=postgresapplication_name=019_replslot_limit.pl
 
  2022-03-25 04:14:03.265 CET [2674463][client backend][5/8:0][019_replslot_limit.pl] LOG:  statement: SELECT * FROM
pg_stat_activity
  ...
  2022-03-25 04:14:03.270 CET [2674384][postmaster][:0][] DEBUG:  server process (PID 2674463) exited with exit code 0
  ...
  2022-03-25 04:14:03.324 CET [2674398][walsender][4/0:0][019_replslot_limit.pl] DEBUG:  replication slot drop:
pg_basebackup_2674398:done
 

We are able to start / finish several new connections between the two debug
elog()sin ReplicationSlotDropPtr()?


I wonder if there's something odd going on with ConditionVariableBroadcast().

Might be useful to add another debug message before/after
ConditionVariableBroadcast() and rmtree(). If the delay is due to rmtree()
being slow under concurrent tests I'd feel less concerned (although that
machine has dual NVMe drives...).


I really wish I could reproduce the failure. I went through a few hundred
cycles of that test in a separate checkout on that machine.


> I'd like to silence this noise so that we can start tracking
> lower-probability failure modes, like say these:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=olingo&dt=2022-03-26%2002%3A59%3A03

That one was a missing compiler flag. I see that failure reproducibly locally
when using asan with clang, unless I use -fsanitize-address-use-after-return=never. gcc has a different default
for the option, which is why I hadn't configured it.

If I understand correctly, the problem is that
-fsanitize-address-use-after-return uses an alternative stack.  Sometimes our
stack depths functions get called with the "proper" stack, and sometimes with
a "shadow" stack. Which breaks our stack depth checking.


> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-03-26%2015%3A53%3A51

This one I have no idea about yet. I assume it's just a race in a new test...

Greetings,

Andres Freund

Re: Race conditions in 019_replslot_limit.pl

От

Tom Lane

Дата:

27 марта 2022 г., 21:36:14

Andres Freund <andres@anarazel.de> writes:
> I still feel like there's something off here. But that's probably not enough
> to keep causing failures. I'm inclined to leave the debugging in for a bit
> longer, but not fail the test anymore?

WFM.

> The way the temporary slot removal hangs for a while seems just off:

Perhaps, but right now we're causing noise in the buildfarm and
learning nothing.

            regards, tom lane

Re: Race conditions in 019_replslot_limit.pl

От

Andres Freund

Дата:

28 марта 2022 г., 05:37:34

Hi,

On 2022-03-27 17:36:14 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > I still feel like there's something off here. But that's probably not enough
> > to keep causing failures. I'm inclined to leave the debugging in for a bit
> > longer, but not fail the test anymore?
> 
> WFM.

I've done so now.

Greetings,

Andres Freund

Re: Race conditions in 019_replslot_limit.pl

От

Andres Freund

Дата:

30 мая 2022 г., 19:01:55

Hi,

On 2022-03-27 22:37:34 -0700, Andres Freund wrote:
> On 2022-03-27 17:36:14 -0400, Tom Lane wrote:
> > Andres Freund <andres@anarazel.de> writes:
> > > I still feel like there's something off here. But that's probably not enough
> > > to keep causing failures. I'm inclined to leave the debugging in for a bit
> > > longer, but not fail the test anymore?
> > 
> > WFM.
> 
> I've done so now.

I did look over the test results a couple times since then and once more
today. There were a few cases with pretty significant numbers of iterations:

The highest is
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2022-04-07%2022%3A14%3A03
showing:
# multiple walsenders active in iteration 19

It's somewhat interesting that the worst case was just around the feature
freeze, where the load on my buildfarm animal boxes was higher than normal.

I comparison to earlier approaches, with the current in-tree approach, we
don't do anything when hitting the "problem", other than wait. Which does give
us additional information - afaics there's nothing at all indicating that some
other backend existed allowing the replication slot drop to finish.

It just looks like for reasons I still do not understand, removing a directory
and 2 files or so takes multiple seconds (at least ~36 new connections, 18
pg_usleep(100_100)), while there are no other indications of problems.

I also still don't have a theory why this suddenly started to happen.

Unless somebody has another idea, I'm planning to remove all the debugging
code added, but keep the retry based approach in 019_replslot_limit.pl, so we
don't again get all the spurious failures.

Greetings,

Andres Freund

Re: Race conditions in 019_replslot_limit.pl

От

Kyotaro Horiguchi

Дата:

31 мая 2022 г., 01:31:07

At Mon, 30 May 2022 12:01:55 -0700, Andres Freund <andres@anarazel.de> wrote in 
> Hi,
> 
> On 2022-03-27 22:37:34 -0700, Andres Freund wrote:
> > On 2022-03-27 17:36:14 -0400, Tom Lane wrote:
> > > Andres Freund <andres@anarazel.de> writes:
> > > > I still feel like there's something off here. But that's probably not enough
> > > > to keep causing failures. I'm inclined to leave the debugging in for a bit
> > > > longer, but not fail the test anymore?
> > > 
> > > WFM.
> > 
> > I've done so now.
> 
> I did look over the test results a couple times since then and once more
> today. There were a few cases with pretty significant numbers of iterations:
> 
> The highest is
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2022-04-07%2022%3A14%3A03
> showing:
> # multiple walsenders active in iteration 19
> 
> It's somewhat interesting that the worst case was just around the feature
> freeze, where the load on my buildfarm animal boxes was higher than normal.

If disk is too busy, CheckPointReplicationSlots may take very long.

> I comparison to earlier approaches, with the current in-tree approach, we
> don't do anything when hitting the "problem", other than wait. Which does give
> us additional information - afaics there's nothing at all indicating that some
> other backend existed allowing the replication slot drop to finish.

preventing?  Only checkpointer and a client backend that ran "SELECT * FROM
pg_stat_activity" are the only processes that are running during the
blocking state.

> It just looks like for reasons I still do not understand, removing a
directory
> and 2 files or so takes multiple seconds (at least ~36 new connections, 18
> pg_usleep(100_100)), while there are no other indications of problems.

That fact suports that CheckPointReplicationSlots took long time.

> I also still don't have a theory why this suddenly started to happen.

Maybe we need to see the load of disks at that time OS-wide. Couldn't
compiler or other non-postgres tools put significant load to disks?

> Unless somebody has another idea, I'm planning to remove all the debugging
> code added, but keep the retry based approach in 019_replslot_limit.pl, so we
> don't again get all the spurious failures.

+1.  

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Race conditions in 019_replslot_limit.pl

Вложения