Обсуждение: failure in 019_replslot_limit

Поиск
Список
Период
Сортировка

failure in 019_replslot_limit

От
Andres Freund
Дата:
Hi,

I just saw the following failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2023-04-05%2017%3A47%3A03
after a commit of mine. The symptoms look unrelated though.

[17:54:42.188](258.345s) # poll_query_until timed out executing this query:
# SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'
# expecting this output:
# lost
# last actual query output:
# unreserved
# with stderr:
timed out waiting for slot to be lost at /home/bf/bf-build/mylodon/HEAD/pgsql/src/test/recovery/t/019_replslot_limit.pl
line400.
 

We're expecting "lost" but are getting "unreserved".


At first I though this was just a race - it's not guaranteed that a checkpoint
to remove the WAL files occurs anytime soon.

But there might be something else going on - in this case a checkpoint
started, but never finished:

2023-04-05 17:50:23.786 UTC [345177] 019_replslot_limit.pl LOG:  statement: SELECT pg_switch_wal();
2023-04-05 17:50:23.787 UTC [342404] LOG:  checkpoints are occurring too frequently (2 seconds apart)
2023-04-05 17:50:23.787 UTC [342404] HINT:  Consider increasing the configuration parameter "max_wal_size".
2023-04-05 17:50:23.787 UTC [342404] LOG:  checkpoint starting: wal
2023-04-05 17:50:23.837 UTC [345264] 019_replslot_limit.pl LOG:  statement: CREATE TABLE t ();
2023-04-05 17:50:23.839 UTC [345264] 019_replslot_limit.pl LOG:  statement: DROP TABLE t;
2023-04-05 17:50:23.840 UTC [345264] 019_replslot_limit.pl LOG:  statement: SELECT pg_switch_wal();
2023-04-05 17:50:23.841 UTC [342404] LOG:  terminating process 344783 to release replication slot "rep3"
2023-04-05 17:50:23.841 UTC [342404] DETAIL:  The slot's restart_lsn 0/7000D8 exceeds the limit by 1048360 bytes.
2023-04-05 17:50:23.841 UTC [342404] HINT:  You might need to increase max_slot_wal_keep_size.
2023-04-05 17:50:23.862 UTC [344783] standby_3 FATAL:  terminating connection due to administrator command
2023-04-05 17:50:23.862 UTC [344783] standby_3 STATEMENT:  START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1
2023-04-05 17:50:23.893 UTC [345314] 019_replslot_limit.pl LOG:  statement: SELECT wal_status FROM pg_replication_slots
WHEREslot_name = 'rep3'
 
[many repetitions of the above, just differing in time and pid]
2023-04-05 17:54:42.084 UTC [491062] 019_replslot_limit.pl LOG:  statement: SELECT wal_status FROM pg_replication_slots
WHEREslot_name = 'rep3'
 
2023-04-05 17:54:42.200 UTC [342365] LOG:  received immediate shutdown request
2023-04-05 17:54:42.229 UTC [342365] LOG:  database system is shut down

Note that a checkpoint started at "17:50:23.787", but didn't finish before the
database was shut down.  As far as I can tell, this can not be caused by
checkpoint_timeout, because by the time we get to invalidating replication
slots, we already did CheckPointBuffers(), and that's the only thing that
delays based on checkpoint_timeout.

ISTM that this indicates that checkpointer got stuck after signalling
344783.

Do you see any other explanation?

Greetings,

Andres Freund



Re: failure in 019_replslot_limit

От
Andres Freund
Дата:
Hi,

On 2023-04-05 11:48:53 -0700, Andres Freund wrote:
> Note that a checkpoint started at "17:50:23.787", but didn't finish before the
> database was shut down.  As far as I can tell, this can not be caused by
> checkpoint_timeout, because by the time we get to invalidating replication
> slots, we already did CheckPointBuffers(), and that's the only thing that
> delays based on checkpoint_timeout.
> 
> ISTM that this indicates that checkpointer got stuck after signalling
> 344783.
> 
> Do you see any other explanation?

This all sounded vaguely familiar. After a bit bit of digging I found this:

https://postgr.es/m/20220223014855.4lsddr464i7mymk2%40alap3.anarazel.de

Which seems like it plausibly explains the failed test?

Greetings,

Andres Freund



Re: failure in 019_replslot_limit

От
Kyotaro Horiguchi
Дата:
At Wed, 5 Apr 2023 11:55:14 -0700, Andres Freund <andres@anarazel.de> wrote in 
> Hi,
> 
> On 2023-04-05 11:48:53 -0700, Andres Freund wrote:
> > Note that a checkpoint started at "17:50:23.787", but didn't finish before the
> > database was shut down.  As far as I can tell, this can not be caused by
> > checkpoint_timeout, because by the time we get to invalidating replication
> > slots, we already did CheckPointBuffers(), and that's the only thing that
> > delays based on checkpoint_timeout.
> > 
> > ISTM that this indicates that checkpointer got stuck after signalling
> > 344783.
> > 
> > Do you see any other explanation?
> 
> This all sounded vaguely familiar. After a bit bit of digging I found this:
> 
> https://postgr.es/m/20220223014855.4lsddr464i7mymk2%40alap3.anarazel.de
> 
> Which seems like it plausibly explains the failed test?

As my understanding, ConditionVariableSleep() can experience random
wake-ups and ReplicationSlotControlLock doesn't prevent slot
release. So, I can imagine a situation where that blocking might
happen.  If the call ConditionVariableSleep(&s->active_cv) wakes up
unexpectedly due to a latch set for reasons other than the CV
broadcast, and the target process releases the slot between fetching
active_pid in the loop and the following call to
ConditionVariablePrepareToSleep(), the CV broadcast triggered by the
slot release might be missed. If that's the case, we'll need to check
active_pid again after the calling ConditionVariablePrepareToSleep().

Does this make sense?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: failure in 019_replslot_limit

От
Alexander Lakhin
Дата:
Hello Andres,

05.04.2023 21:48, Andres Freund wrote:
> I just saw the following failure:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2023-04-05%2017%3A47%3A03
> after a commit of mine. The symptoms look unrelated though.
>
> [17:54:42.188](258.345s) # poll_query_until timed out executing this query:
> # SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'
> # expecting this output:
> # lost
> # last actual query output:
> # unreserved
> # with stderr:
> timed out waiting for slot to be lost at
/home/bf/bf-build/mylodon/HEAD/pgsql/src/test/recovery/t/019_replslot_limit.plline 400.
 
>
> We're expecting "lost" but are getting "unreserved".
>
> ...
>
> ISTM that this indicates that checkpointer got stuck after signalling
> 344783.
>
> Do you see any other explanation?
>

I've managed to reproduce this issue (which still persists:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2024-02-04%2001%3A53%3A44
) and saw that it's not checkpointer, but walsender is hanging:
[12:15:03.753](34.771s) ok 17 - have walsender pid 317885
[12:15:03.875](0.122s) ok 18 - have walreceiver pid 317884
[12:15:04.808](0.933s) ok 19 - walsender termination logged
...

Last essential messages in _primary3.log are:
2024-02-09 12:15:04.823 UTC [318036][not initialized][:0] LOG: connection received: host=[local]
2024-02-09 12:15:04.823 UTC [317885][walsender][3/0:0] FATAL: terminating connection due to administrator command
2024-02-09 12:15:04.823 UTC [317885][walsender][3/0:0] STATEMENT: START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1
(then the test just queries the slot state, there are no other messages
related to walsender)

And I see the walsender process still running (I've increased the timeout
to keep the test running and to connect to the process in question), with
the following stack trace:
#0  0x00007fe4feac3d16 in epoll_wait (epfd=5, events=0x55b279b70f38, maxevents=1, timeout=timeout@entry=-1) at 
../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x000055b278b9ab32 in WaitEventSetWaitBlock (set=set@entry=0x55b279b70eb8, cur_timeout=cur_timeout@entry=-1, 
occurred_events=occurred_events@entry=0x7ffda5ffac90, nevents=nevents@entry=1) at latch.c:1571
#2  0x000055b278b9b6b6 in WaitEventSetWait (set=0x55b279b70eb8, timeout=timeout@entry=-1, 
occurred_events=occurred_events@entry=0x7ffda5ffac90, nevents=nevents@entry=1, 
wait_event_info=wait_event_info@entry=100663297) at latch.c:1517
#3  0x000055b278a3f11f in secure_write (port=0x55b279b65aa0, ptr=ptr@entry=0x55b279bfbd08, len=len@entry=21470) at 
be-secure.c:296
#4  0x000055b278a460dc in internal_flush () at pqcomm.c:1356
#5  0x000055b278a461d4 in internal_putbytes (s=s@entry=0x7ffda5ffad3c "E\177", len=len@entry=1) at pqcomm.c:1302
#6  0x000055b278a46299 in socket_putmessage (msgtype=<optimized out>, s=0x55b279b363c0 "SFATAL", len=112) at
pqcomm.c:1483
#7  0x000055b278a48670 in pq_endmessage (buf=buf@entry=0x7ffda5ffada0) at pqformat.c:302
#8  0x000055b278d0c82a in send_message_to_frontend (edata=edata@entry=0x55b27908e500 <errordata>) at elog.c:3590
#9  0x000055b278d0cfe2 in EmitErrorReport () at elog.c:1716
#10 0x000055b278d0d17d in errfinish (filename=filename@entry=0x55b278eaa480 "postgres.c", lineno=lineno@entry=3295, 
funcname=funcname@entry=0x55b278eaaef0 <__func__.16> "ProcessInterrupts") at elog.c:551
#11 0x000055b278bc41c9 in ProcessInterrupts () at postgres.c:3295
#12 0x000055b278b6c9af in WalSndLoop (send_data=send_data@entry=0x55b278b6c346 <XLogSendPhysical>) at walsender.c:2680
#13 0x000055b278b6cef1 in StartReplication (cmd=cmd@entry=0x55b279b733e0) at walsender.c:987
#14 0x000055b278b6d865 in exec_replication_command (cmd_string=cmd_string@entry=0x55b279b39c60 "START_REPLICATION SLOT

\"rep3\" 0/700000 TIMELINE 1") at walsender.c:2039
#15 0x000055b278bc7d71 in PostgresMain (dbname=<optimized out>, username=<optimized out>) at postgres.c:4649
#16 0x000055b278b2329d in BackendRun (port=port@entry=0x55b279b65aa0) at postmaster.c:4464
#17 0x000055b278b263ae in BackendStartup (port=port@entry=0x55b279b65aa0) at postmaster.c:4140
#18 0x000055b278b26539 in ServerLoop () at postmaster.c:1776
#19 0x000055b278b27ac5 in PostmasterMain (argc=argc@entry=4, argv=argv@entry=0x55b279b34180) at postmaster.c:1475
#20 0x000055b278a49ab0 in main (argc=4, argv=0x55b279b34180) at main.c:198

(gdb) frame 9
(gdb) print edata->message
$3 = 0x55b279b367d0 "terminating connection due to administrator command"

So it looks like walsender tries to send the message to walreceiver, which
is paused, and thus walsender gets stuck on it.

Best regards,
Alexander



Re: failure in 019_replslot_limit

От
Andres Freund
Дата:
Hi,

On 2024-02-09 18:00:01 +0300, Alexander Lakhin wrote:
> I've managed to reproduce this issue (which still persists:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2024-02-04%2001%3A53%3A44
> ) and saw that it's not checkpointer, but walsender is hanging:

How did you reproduce this?



> And I see the walsender process still running (I've increased the timeout
> to keep the test running and to connect to the process in question), with
> the following stack trace:
> #0  0x00007fe4feac3d16 in epoll_wait (epfd=5, events=0x55b279b70f38,
> maxevents=1, timeout=timeout@entry=-1) at
> ../sysdeps/unix/sysv/linux/epoll_wait.c:30
> #1  0x000055b278b9ab32 in WaitEventSetWaitBlock
> (set=set@entry=0x55b279b70eb8, cur_timeout=cur_timeout@entry=-1,
> occurred_events=occurred_events@entry=0x7ffda5ffac90,
> nevents=nevents@entry=1) at latch.c:1571
> #2  0x000055b278b9b6b6 in WaitEventSetWait (set=0x55b279b70eb8,
> timeout=timeout@entry=-1,
> occurred_events=occurred_events@entry=0x7ffda5ffac90,
> nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=100663297) at
> latch.c:1517
> #3  0x000055b278a3f11f in secure_write (port=0x55b279b65aa0,
> ptr=ptr@entry=0x55b279bfbd08, len=len@entry=21470) at be-secure.c:296
> #4  0x000055b278a460dc in internal_flush () at pqcomm.c:1356
> #5  0x000055b278a461d4 in internal_putbytes (s=s@entry=0x7ffda5ffad3c "E\177", len=len@entry=1) at pqcomm.c:1302

So it's the issue that we wait effectively forever to to send a FATAL. I've
previously proposed that we should not block sending out fatal errors, given
that allows clients to do prevent graceful restarts and a lot of other things.

Greetings,

Andres Freund



Re: failure in 019_replslot_limit

От
Alexander Lakhin
Дата:
09.02.2024 21:59, Andres Freund wrote:
>
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2024-02-04%2001%3A53%3A44
>> ) and saw that it's not checkpointer, but walsender is hanging:
> How did you reproduce this?

As kestrel didn't produce this failure until recently, I supposed that the
cause is the same as with subscription/031_column_list — longer test
duration, so I ran this test in parallel (with 20-30 jobs) in a slowed
down VM, so that one successful test duration increased to 100-120 seconds.
And I was lucky enough to catch it within 100 iterations. But now, that we
know what's happening there, I think I could reproduce it much easily,
with some sleep(s) added, if it would be of any interest.

> So it's the issue that we wait effectively forever to to send a FATAL. I've
> previously proposed that we should not block sending out fatal errors, given
> that allows clients to do prevent graceful restarts and a lot of other things.
>

Yes, I had demonstrated one of those unpleasant things previously too:
https://www.postgresql.org/message-id/91c8860a-a866-71a7-a060-3f07af531295%40gmail.com

Best regards,
Alexander