Re: BUG #17327: Postgres server does not correctly emit error for max_slot_wal_keep_size being breached

Поиск
Список
Период
Сортировка
От Alex Enachioaie
Тема Re: BUG #17327: Postgres server does not correctly emit error for max_slot_wal_keep_size being breached
Дата
Msg-id Z4PW3R.84TE92O82TFQ@altmetric.com
обсуждение исходный текст
Ответ на Re: BUG #17327: Postgres server does not correctly emit error for max_slot_wal_keep_size being breached  (Masahiko Sawada <sawada.mshk@gmail.com>)
Список pgsql-bugs
Hello Masahiko,

I believe that's inaccurate as the way we actually tracked down the root cause in our situation was that we tried using a persistent slot to
check if that would make any difference to the process and although that failed as well (for the same reason) we did get an additional 
error logged in that case. This specifically stated that the max_slot_wal_keep_size limit had been reached. So it seems like the behavior
does differ between the two unless I'm missing something.

Please see relevant server-side logs from running the exact same pg_basebackup command with a persistent replication slot below:

2021-12-03 00:57:16 UTC [29724-2645] LOG:  terminating process 3899 to release replication slot "backup"
2021-12-03 00:57:16 UTC [3899-1] replicator@[unknown] FATAL:  terminating connection due to administrator command
2021-12-03 00:57:16 UTC [3899-2] replicator@[unknown] STATEMENT:  START_REPLICATION SLOT "backup" 47189/75000000 TIMELINE 3
2021-12-03 00:57:16 UTC [29724-2646] LOG:  invalidating slot "backup" because its restart_lsn 47198/1E000000 exceeds max_slot_wal_keep_size

So, essentially the server side log emmitted on a temporary replication breaching max_slot_wal_keep_size limit is only:

2021-12-03 16:21:54 UTC [29724-2647] LOG:  terminating process 42601 to release replication slot "pg_basebackup_42601"

whereas for a persistent replication slot we get an additional line that clearly states _why_ the replication process was terminated:

2021-12-03 00:57:16 UTC [29724-2645] LOG: terminating process 3899 to release replication slot "backup"
2021-12-03 00:57:16 UTC [29724-2646] LOG: invalidating slot "backup" because its restart_lsn 47198/1E000000 exceeds max_slot_wal_keep_size

I'm not sure if this means that in the case of a temporary slot it does not get invalidated at all (I've not looked at the code), or it's simply that we don't emit a log message when it does because the slot would be discarded anyway, but such a message would be very useful for diagnostic purposes imo.

Thank you for looking at this

Kind regards,

Alex Enachioaie
Senior Site Reliability Engineer
Altmetric

On Fri, Dec 10 2021 at 17:41:56 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Wed, Dec 8, 2021 at 9:58 PM PG Bug reporting form <noreply@postgresql.org> wrote:
The following bug has been logged on the website: Bug reference: 17327 Logged by: Alex E Email address: alex@altmetric.com PostgreSQL version: 13.5 Operating system: Ubuntu 18.04 Description: We have recently run into a situation where our pg_basebackup-based backups started failing unexpectedly. These use WAL streaming to keep up with changes (which uses a temporary replication slot server side). The only errors logged on the client side were as listed below: pg_basebackup: error: could not receive data from WAL stream: SSL connection has been closed unexpectedly pg_basebackup: error: could not read COPY data: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. pg_basebackup: removing contents of data directory "/backups/some/path/" whilst on the server side we only got: 2021-12-03 16:21:54 UTC [29724-2647] LOG: terminating process 42601 to release replication slot "pg_basebackup_42601" 2021-12-03 16:21:54 UTC [42601-1] replicator@[unknown] FATAL: terminating connection due to administrator command 2021-12-03 16:21:54 UTC [42601-2] replicator@[unknown] STATEMENT: START_REPLICATION SLOT "pg_basebackup_42601" 4721F/45000000 TIMELINE 3 The above was very unhelpful as it made us believe we might be dealing with either a network interruption or another type of mysterious hardware error. We then proceeded to try several things to try and determine the root cause of the problem and eventually realized (by trial and error and monitoring various statistics) that we were breaching our max_slot_wal_keep_size limit for the temporary replication slot whilst taking the pg_basebackup. The only way we realized this was by using a permanent physical replication slot to take the backup instead of a temporary one, and when doing this a relevant error related to max_slot_wal_keep_size being breached was issued. The core issue here then in our opinion is that Postgres server should log an error when the max_slot_wal_keep_size limit is reached for temporary replication slots as well as for permanent ones as otherwise users/administrators are presented only with non-descript connection termination errors which do not point to the actual cause of the problem.
IIUC the LOG message you got, "terminating process 42601 to release replication slot "pg_basebackup_42601"", is the message you want. That is, it's emitted when the replication slot reached max_slot_wal_keep_size and was invalidated. I don't think such behavior varies depending on persistent slots and temporary slots. If the message "terminating process XXXX to release replication slot "YYY"" is not clear to indicate that the replication slot is invalidated, probably we can improve the message or add details. Regards,
--
Masahiko Sawada EDB: https://www.enterprisedb.com/

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Dmitry Dolgov
Дата:
Сообщение: Re: BUG #17326: Postgres crashed when pg_reload_conf() with ssl certificate parameters
Следующее
От: Strahinja Kustudić
Дата:
Сообщение: Re: BUG #17330: EXPLAIN hangs and very long query plans