Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

Поиск

Список

Период

Сортировка

От	hubert depesz lubaczewski
Тема	Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Дата	10 февраля 2023 г. 14:31:24
Msg-id	Y+ZVPHHcYirQDgJF@depesz.com обсуждение
Ответ на	Re: WAL segments removed from primary despite the fact that logical replication slot needs it. (Masahiko Sawada <sawada.mshk@gmail.com>)
Список	pgsql-bugs

Дерево обсуждения

Hi,
so, we have another bit of interesting information. maybe related, maybe
not.

We noticed weird situation on two clusters we're trying to upgrade.

In both cases sitaution looked the same:

1. there was another process (debezium) connected to source (pg12) using
   logical replication
2. pg12 -> pg14 replication failed with the message 'ERROR:  requested
   WAL segment ... has already been '
3. some time afterwards (most likely couple of hours) the process that
   is/was responsible for debezium replicaiton (pg process) stopped
   handling WAL, but instead is eating 100% of cpu.

When this situation happens, we can't pg_cancel_backend(pid) for the
"broken" wal sender, it also can't be pg_terminate_backend() !

strace of the process doesn't show anything.

When I tried to get backtrace from gdb all I got was:

(gdb) bt
#0  0x0000aaaad270521c in hash_seq_search ()
#1  0x0000ffff806c86cc in ?? () from /usr/lib/postgresql/12/lib/pgoutput.so
#2  0x0000aaaad26e3644 in CallSyscacheCallbacks ()
#3  0x0000aaaad26e3644 in CallSyscacheCallbacks ()
#4  0x0000aaaad257764c in ReorderBufferCommit ()
#5  0x0000aaaad256c804 in ?? ()
#6  0x0000aaaaf303d280 in ?? ()

If I'd quit gdb, and restart, and redo bt, I get 

#0  0x0000ffff806c81a8 in hash_seq_search@plt () from /usr/lib/postgresql/12/lib/pgoutput.so
#1  0x0000ffff806c86cc in ?? () from /usr/lib/postgresql/12/lib/pgoutput.so
#2  0x0000aaaad291ae58 in ?? ()

or

#0  0x0000aaaad2705244 in hash_seq_search ()
#1  0x0000ffff806c86cc in ?? () from /usr/lib/postgresql/12/lib/pgoutput.so
#2  0x0000aaaad26e3644 in CallSyscacheCallbacks ()
#3  0x0000aaaad26e3644 in CallSyscacheCallbacks ()
#4  0x0000aaaad257764c in ReorderBufferCommit ()
#5  0x0000aaaad256c804 in ?? ()
#6  0x0000aaaaf303d280 in ?? ()

At this moment, the only thing that we can do is kill -9 the process (or
restart pg).

I don't know if it's relevant, but I have this case *right now*, and if
it's helpful I can provide more information before we will have to kill
it.

Best regards,

depesz

В списке pgsql-bugs по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: WAL segments removed from primary despite the fact that logical replication slot needs it.