Hi,
so, we have another bit of interesting information. maybe related, maybe
not.
We noticed weird situation on two clusters we're trying to upgrade.
In both cases sitaution looked the same:
1. there was another process (debezium) connected to source (pg12) using
logical replication
2. pg12 -> pg14 replication failed with the message 'ERROR: requested
WAL segment ... has already been '
3. some time afterwards (most likely couple of hours) the process that
is/was responsible for debezium replicaiton (pg process) stopped
handling WAL, but instead is eating 100% of cpu.
When this situation happens, we can't pg_cancel_backend(pid) for the
"broken" wal sender, it also can't be pg_terminate_backend() !
strace of the process doesn't show anything.
When I tried to get backtrace from gdb all I got was:
(gdb) bt
#0 0x0000aaaad270521c in hash_seq_search ()
#1 0x0000ffff806c86cc in ?? () from /usr/lib/postgresql/12/lib/pgoutput.so
#2 0x0000aaaad26e3644 in CallSyscacheCallbacks ()
#3 0x0000aaaad26e3644 in CallSyscacheCallbacks ()
#4 0x0000aaaad257764c in ReorderBufferCommit ()
#5 0x0000aaaad256c804 in ?? ()
#6 0x0000aaaaf303d280 in ?? ()
If I'd quit gdb, and restart, and redo bt, I get
#0 0x0000ffff806c81a8 in hash_seq_search@plt () from /usr/lib/postgresql/12/lib/pgoutput.so
#1 0x0000ffff806c86cc in ?? () from /usr/lib/postgresql/12/lib/pgoutput.so
#2 0x0000aaaad291ae58 in ?? ()
or
#0 0x0000aaaad2705244 in hash_seq_search ()
#1 0x0000ffff806c86cc in ?? () from /usr/lib/postgresql/12/lib/pgoutput.so
#2 0x0000aaaad26e3644 in CallSyscacheCallbacks ()
#3 0x0000aaaad26e3644 in CallSyscacheCallbacks ()
#4 0x0000aaaad257764c in ReorderBufferCommit ()
#5 0x0000aaaad256c804 in ?? ()
#6 0x0000aaaaf303d280 in ?? ()
At this moment, the only thing that we can do is kill -9 the process (or
restart pg).
I don't know if it's relevant, but I have this case *right now*, and if
it's helpful I can provide more information before we will have to kill
it.
Best regards,
depesz