Backend handling replication slot stuck using 100% cpu, unkillable

Поиск

Список

Период

Сортировка

От	hubert depesz lubaczewski
Тема	Backend handling replication slot stuck using 100% cpu, unkillable
Дата	3 июля 2023 г. 11:36:32
Msg-id	ZKKywNsS9tR/3R80@depesz.com обсуждение исходный текст
Ответы	Re: Backend handling replication slot stuck using 100% cpu, unkillable
Список	pgsql-bugs

Дерево обсуждения

Hi,
we are using debezium to get change data from Pg.

This particular pg is 12.9, and will be soon upgrade to 14.something
(this thursday).

but as of now, we have weird case, one that we've seen before on other
clusters.
MSpecifically - one of backends seems to be stuck in infinite loop
- using 100% of single core.

pg_stat_activity looks like:

#v+
─[ RECORD 1
]────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
datid            │ 16606
datname          │ dbname
pid              │ 14586
usesysid         │ 16584
usename          │ some_user
application_name │ Debezium Streaming
client_addr      │ 10.a.b.c
client_hostname  │ [null]
client_port      │ 57546
backend_start    │ 2023-07-02 18:33:00.302983+00
xact_start       │ [null]
query_start      │ 2023-07-02 18:33:00.784202+00
state_change     │ 2023-07-02 18:33:00.784215+00
wait_event_type  │ [null]
wait_event       │ [null]
state            │ active
backend_xid      │ [null]
backend_xmin     │ [null]
query            │ START_REPLICATION SLOT "slot_name" LOGICAL 1D65/EAF6F980 ("proto_version" '1', "publication_names"
'xxx')
backend_type     │ walsender
#v-

strace doesn't show any output:

#v+
=$ sudo strace -f -p 14586
strace: Process 14586 attached
#v-

gdb shows:

#v+
$ sudo gdb -p 14586
...
Reading symbols from /usr/lib/postgresql/12/lib/auto_explain.so...(no debugging symbols found)...done.
Reading symbols from /usr/lib/postgresql/12/lib/pglogical.so...(no debugging symbols found)...done.
Reading symbols from /usr/lib/aarch64-linux-gnu/libpq.so.5...(no debugging symbols found)...done.
Reading symbols from /usr/lib/postgresql/12/lib/pg_prewarm.so...(no debugging symbols found)...done.
Reading symbols from /lib/aarch64-linux-gnu/libnss_files.so.2...(no debugging symbols found)...done.
Reading symbols from /usr/lib/postgresql/12/lib/pgoutput.so...(no debugging symbols found)...done.
0x0000aaaab6b5123c in hash_seq_search ()
(gdb) bt
#0  0x0000aaaab6b5123c in hash_seq_search ()
#1  0x0000ffffb97c46cc in ?? () from /usr/lib/postgresql/12/lib/pgoutput.so
#2  0x0000aaaab6b2f644 in CallSyscacheCallbacks ()
#3  0x0000aaaab6b2f644 in CallSyscacheCallbacks ()
#4  0x0000aaaab69c364c in ReorderBufferCommit ()
#5  0x0000aaaab69b8804 in ?? ()
#6  0x0000aaaaec8f4bf0 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb)
#v-

After I quit gdb, and restarted it like a minute later I got this backtrace:

#v+
(gdb) bt
#0  0x0000aaaab6b51200 in hash_seq_search ()
#1  0x0000ffffb97c46cc in ?? () from /usr/lib/postgresql/12/lib/pgoutput.so
#2  0x0000aaaab6d66e58 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
#v-

I can't pg_cancel_query, it doesn't also respond to pg_terminate_backend.

I know I could kill -9, but that would cycle all db connections, which is something I'd prefer to avoid.

Is there anything I could do?

Best regards,

depesz

В списке pgsql-bugs по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Backend handling replication slot stuck using 100% cpu, unkillable