failure in 019_replslot_limit

Поиск
Список
Период
Сортировка
От Andres Freund
Тема failure in 019_replslot_limit
Дата
Msg-id 20230405184853.hjlxbrzjaxrcfjpl@awork3.anarazel.de
обсуждение исходный текст
Ответы Re: failure in 019_replslot_limit  (Andres Freund <andres@anarazel.de>)
Re: failure in 019_replslot_limit  (Alexander Lakhin <exclusion@gmail.com>)
Список pgsql-hackers
Hi,

I just saw the following failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2023-04-05%2017%3A47%3A03
after a commit of mine. The symptoms look unrelated though.

[17:54:42.188](258.345s) # poll_query_until timed out executing this query:
# SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'
# expecting this output:
# lost
# last actual query output:
# unreserved
# with stderr:
timed out waiting for slot to be lost at /home/bf/bf-build/mylodon/HEAD/pgsql/src/test/recovery/t/019_replslot_limit.pl
line400.
 

We're expecting "lost" but are getting "unreserved".


At first I though this was just a race - it's not guaranteed that a checkpoint
to remove the WAL files occurs anytime soon.

But there might be something else going on - in this case a checkpoint
started, but never finished:

2023-04-05 17:50:23.786 UTC [345177] 019_replslot_limit.pl LOG:  statement: SELECT pg_switch_wal();
2023-04-05 17:50:23.787 UTC [342404] LOG:  checkpoints are occurring too frequently (2 seconds apart)
2023-04-05 17:50:23.787 UTC [342404] HINT:  Consider increasing the configuration parameter "max_wal_size".
2023-04-05 17:50:23.787 UTC [342404] LOG:  checkpoint starting: wal
2023-04-05 17:50:23.837 UTC [345264] 019_replslot_limit.pl LOG:  statement: CREATE TABLE t ();
2023-04-05 17:50:23.839 UTC [345264] 019_replslot_limit.pl LOG:  statement: DROP TABLE t;
2023-04-05 17:50:23.840 UTC [345264] 019_replslot_limit.pl LOG:  statement: SELECT pg_switch_wal();
2023-04-05 17:50:23.841 UTC [342404] LOG:  terminating process 344783 to release replication slot "rep3"
2023-04-05 17:50:23.841 UTC [342404] DETAIL:  The slot's restart_lsn 0/7000D8 exceeds the limit by 1048360 bytes.
2023-04-05 17:50:23.841 UTC [342404] HINT:  You might need to increase max_slot_wal_keep_size.
2023-04-05 17:50:23.862 UTC [344783] standby_3 FATAL:  terminating connection due to administrator command
2023-04-05 17:50:23.862 UTC [344783] standby_3 STATEMENT:  START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1
2023-04-05 17:50:23.893 UTC [345314] 019_replslot_limit.pl LOG:  statement: SELECT wal_status FROM pg_replication_slots
WHEREslot_name = 'rep3'
 
[many repetitions of the above, just differing in time and pid]
2023-04-05 17:54:42.084 UTC [491062] 019_replslot_limit.pl LOG:  statement: SELECT wal_status FROM pg_replication_slots
WHEREslot_name = 'rep3'
 
2023-04-05 17:54:42.200 UTC [342365] LOG:  received immediate shutdown request
2023-04-05 17:54:42.229 UTC [342365] LOG:  database system is shut down

Note that a checkpoint started at "17:50:23.787", but didn't finish before the
database was shut down.  As far as I can tell, this can not be caused by
checkpoint_timeout, because by the time we get to invalidating replication
slots, we already did CheckPointBuffers(), and that's the only thing that
delays based on checkpoint_timeout.

ISTM that this indicates that checkpointer got stuck after signalling
344783.

Do you see any other explanation?

Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Daniel Gustafsson
Дата:
Сообщение: Re: GUC for temporarily disabling event triggers
Следующее
От: Andres Freund
Дата:
Сообщение: Re: failure in 019_replslot_limit