Обсуждение: Backend handling replication slot stuck using 100% cpu, unkillable

Поиск
Список
Период
Сортировка

Backend handling replication slot stuck using 100% cpu, unkillable

От
hubert depesz lubaczewski
Дата:
Hi,
we are using debezium to get change data from Pg.

This particular pg is 12.9, and will be soon upgrade to 14.something
(this thursday).

but as of now, we have weird case, one that we've seen before on other
clusters.
MSpecifically - one of backends seems to be stuck in infinite loop
- using 100% of single core.

pg_stat_activity looks like:

#v+
─[ RECORD 1
]────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
datid            │ 16606
datname          │ dbname
pid              │ 14586
usesysid         │ 16584
usename          │ some_user
application_name │ Debezium Streaming
client_addr      │ 10.a.b.c
client_hostname  │ [null]
client_port      │ 57546
backend_start    │ 2023-07-02 18:33:00.302983+00
xact_start       │ [null]
query_start      │ 2023-07-02 18:33:00.784202+00
state_change     │ 2023-07-02 18:33:00.784215+00
wait_event_type  │ [null]
wait_event       │ [null]
state            │ active
backend_xid      │ [null]
backend_xmin     │ [null]
query            │ START_REPLICATION SLOT "slot_name" LOGICAL 1D65/EAF6F980 ("proto_version" '1', "publication_names"
'xxx')
backend_type     │ walsender
#v-

strace doesn't show any output:

#v+
=$ sudo strace -f -p 14586
strace: Process 14586 attached
#v-

gdb shows:

#v+
$ sudo gdb -p 14586
...
Reading symbols from /usr/lib/postgresql/12/lib/auto_explain.so...(no debugging symbols found)...done.
Reading symbols from /usr/lib/postgresql/12/lib/pglogical.so...(no debugging symbols found)...done.
Reading symbols from /usr/lib/aarch64-linux-gnu/libpq.so.5...(no debugging symbols found)...done.
Reading symbols from /usr/lib/postgresql/12/lib/pg_prewarm.so...(no debugging symbols found)...done.
Reading symbols from /lib/aarch64-linux-gnu/libnss_files.so.2...(no debugging symbols found)...done.
Reading symbols from /usr/lib/postgresql/12/lib/pgoutput.so...(no debugging symbols found)...done.
0x0000aaaab6b5123c in hash_seq_search ()
(gdb) bt
#0  0x0000aaaab6b5123c in hash_seq_search ()
#1  0x0000ffffb97c46cc in ?? () from /usr/lib/postgresql/12/lib/pgoutput.so
#2  0x0000aaaab6b2f644 in CallSyscacheCallbacks ()
#3  0x0000aaaab6b2f644 in CallSyscacheCallbacks ()
#4  0x0000aaaab69c364c in ReorderBufferCommit ()
#5  0x0000aaaab69b8804 in ?? ()
#6  0x0000aaaaec8f4bf0 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb)
#v-

After I quit gdb, and restarted it like a minute later I got this backtrace:

#v+
(gdb) bt
#0  0x0000aaaab6b51200 in hash_seq_search ()
#1  0x0000ffffb97c46cc in ?? () from /usr/lib/postgresql/12/lib/pgoutput.so
#2  0x0000aaaab6d66e58 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
#v-

I can't pg_cancel_query, it doesn't also respond to pg_terminate_backend.

I know I could kill -9, but that would cycle all db connections, which is something I'd prefer to avoid.

Is there anything I could do?

Best regards,

depesz




Re: Backend handling replication slot stuck using 100% cpu, unkillable

От
hubert depesz lubaczewski
Дата:
On Mon, Jul 03, 2023 at 01:36:32PM +0200, hubert depesz lubaczewski wrote:
> Hi,
> we are using debezium to get change data from Pg.
> 
> This particular pg is 12.9, and will be soon upgrade to 14.something
> (this thursday).

So, i installed dbgsym for this pg, and this bny accident upgraded pg to
12.14.

Now I do have debug symbols, though, so backtrace can be more
informative.

I ran:

for i in 1 2 3; do date; sudo gdb -batch -p 8938 -ex bt; sleep 30; echo; done

And got this:

#v+
Mon Jul  3 12:50:14 UTC 2023
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
hash_seq_search (status=status@entry=0xfffff0173f40) at ./build/../src/backend/utils/hash/dynahash.c:1439
1439    ./build/../src/backend/utils/hash/dynahash.c: No such file or directory.
#0  hash_seq_search (status=status@entry=0xfffff0173f40) at ./build/../src/backend/utils/hash/dynahash.c:1439
#1  0x0000ffffa1bc8714 in rel_sync_cache_publication_cb (arg=<optimized out>, cacheid=<optimized out>,
hashvalue=<optimizedout>) at ./build/../src/backend/replication/pgoutput/pgoutput.c:665
 
#2  0x0000aaaab4bfcef4 in CallSyscacheCallbacks (cacheid=47, hashvalue=1542357812) at
./build/../src/backend/utils/cache/inval.c:1520
#3  0x0000aaaab4a91884 in ReorderBufferExecuteInvalidations (rb=0xffffa4ac5308 <malloc+160>, txn=0xfffff0174240,
txn=0xfffff0174240)at ./build/../src/backend/replication/logical/reorderbuffer.c:2187
 
#4  ReorderBufferCommit (rb=0xffffa4ac5308 <malloc+160>, xid=xid@entry=2741814901, commit_lsn=187650155969544,
end_lsn=<optimizedout>, commit_time=commit_time@entry=741514150878208, origin_id=origin_id@entry=0,
origin_lsn=origin_lsn@entry=0)at ./build/../src/backend/replication/logical/reorderbuffer.c:1816
 
#5  0x0000aaaab4a869bc in DecodeCommit (xid=2741814901, parsed=0xfffff0174390, buf=0xfffff0174510, ctx=0xaaaad5e1df00)
at./build/../src/backend/replication/logical/decode.c:654
 
#6  DecodeXactOp (ctx=ctx@entry=0xaaaad5e1df00, buf=0xfffff0174510, buf@entry=0xfffff0174570) at
./build/../src/backend/replication/logical/decode.c:249
#7  0x0000aaaab4a86ad4 in LogicalDecodingProcessRecord (ctx=0xaaaad5e1df00, record=0xaaaad5e1e198) at
./build/../src/backend/replication/logical/decode.c:117
#8  0x0000aaaab4a996ec in XLogSendLogical () at ./build/../src/backend/replication/walsender.c:2883
#9  0x0000aaaab4a9bbb0 in WalSndLoop (send_data=send_data@entry=0xaaaab4a99688 <XLogSendLogical>) at
./build/../src/backend/replication/walsender.c:2232
#10 0x0000aaaab4a9c674 in StartLogicalReplication (cmd=0xaaaad5e47f90) at
./build/../src/backend/replication/walsender.c:1134
#11 exec_replication_command (cmd_string=cmd_string@entry=0xaaaad5d1db00 "START_REPLICATION SLOT \"slot_name\" LOGICAL
1D6C/92965050(\"proto_version\" '1', \"publication_names\" 'xxx')") at
./build/../src/backend/replication/walsender.c:1602
#12 0x0000aaaab4af0c08 in PostgresMain (argc=<optimized out>, argv=argv@entry=0xaaaad5d7aaf8, dbname=<optimized out>,
username=<optimizedout>) at ./build/../src/backend/tcop/postgres.c:4289
 
#13 0x0000aaaab4a759a8 in BackendRun (port=0xaaaad5d76150, port=0xaaaad5d76150) at
./build/../src/backend/postmaster/postmaster.c:4517
#14 BackendStartup (port=0xaaaad5d76150) at ./build/../src/backend/postmaster/postmaster.c:4200
#15 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1725
#16 0x0000aaaab4a769d4 in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at
./build/../src/backend/postmaster/postmaster.c:1398
#17 0x0000aaaab480355c in main (argc=5, argv=0xaaaad5d16720) at ./build/../src/backend/main/main.c:228

Mon Jul  3 12:50:45 UTC 2023
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
0x0000aaaab4c1e11c in hash_search_with_hash_value (hashp=0xaaaad5dea900, keyPtr=0xfffff0173f5c,
keyPtr@entry=0xfffff0173f7c,hashvalue=1107843932, action=action@entry=HASH_FIND, foundPtr=foundPtr@entry=0x0) at
./build/../src/backend/utils/hash/dynahash.c:949
949     ./build/../src/backend/utils/hash/dynahash.c: No such file or directory.
#0  0x0000aaaab4c1e11c in hash_search_with_hash_value (hashp=0xaaaad5dea900, keyPtr=0xfffff0173f5c,
keyPtr@entry=0xfffff0173f7c,hashvalue=1107843932, action=action@entry=HASH_FIND, foundPtr=foundPtr@entry=0x0) at
./build/../src/backend/utils/hash/dynahash.c:949
#1  0x0000aaaab4c1e79c in hash_search (hashp=<optimized out>, keyPtr=keyPtr@entry=0xfffff0173f7c,
action=action@entry=HASH_FIND,foundPtr=foundPtr@entry=0x0) at ./build/../src/backend/utils/hash/dynahash.c:911
 
#2  0x0000aaaab4c07180 in RelationCacheInvalidateEntry (relationId=<optimized out>) at
./build/../src/backend/utils/cache/relcache.c:2820
#3  0x0000aaaab4bfcfe0 in LocalExecuteInvalidationMessage (msg=0xffffa033ab88) at
./build/../src/backend/utils/cache/inval.c:603
#4  0x0000aaaab4a91884 in ReorderBufferExecuteInvalidations (rb=0xffffa4ac5308 <malloc+160>, txn=0xfffff0174240,
txn=0xfffff0174240)at ./build/../src/backend/replication/logical/reorderbuffer.c:2187
 
#5  ReorderBufferCommit (rb=0xffffa4ac5308 <malloc+160>, xid=xid@entry=2741814901, commit_lsn=187650155969544,
end_lsn=<optimizedout>, commit_time=commit_time@entry=741514150878208, origin_id=origin_id@entry=0,
origin_lsn=origin_lsn@entry=0)at ./build/../src/backend/replication/logical/reorderbuffer.c:1816
 
#6  0x0000aaaab4a869bc in DecodeCommit (xid=2741814901, parsed=0xfffff0174390, buf=0xfffff0174510, ctx=0xaaaad5e1df00)
at./build/../src/backend/replication/logical/decode.c:654
 
#7  DecodeXactOp (ctx=ctx@entry=0xaaaad5e1df00, buf=0xfffff0174510, buf@entry=0xfffff0174570) at
./build/../src/backend/replication/logical/decode.c:249
#8  0x0000aaaab4a86ad4 in LogicalDecodingProcessRecord (ctx=0xaaaad5e1df00, record=0xaaaad5e1e198) at
./build/../src/backend/replication/logical/decode.c:117
#9  0x0000aaaab4a996ec in XLogSendLogical () at ./build/../src/backend/replication/walsender.c:2883
#10 0x0000aaaab4a9bbb0 in WalSndLoop (send_data=send_data@entry=0xaaaab4a99688 <XLogSendLogical>) at
./build/../src/backend/replication/walsender.c:2232
#11 0x0000aaaab4a9c674 in StartLogicalReplication (cmd=0xaaaad5e47f90) at
./build/../src/backend/replication/walsender.c:1134
#12 exec_replication_command (cmd_string=cmd_string@entry=0xaaaad5d1db00 "START_REPLICATION SLOT \"slot_name\" LOGICAL
1D6C/92965050(\"proto_version\" '1', \"publication_names\" 'xxx')") at
./build/../src/backend/replication/walsender.c:1602
#13 0x0000aaaab4af0c08 in PostgresMain (argc=<optimized out>, argv=argv@entry=0xaaaad5d7aaf8, dbname=<optimized out>,
username=<optimizedout>) at ./build/../src/backend/tcop/postgres.c:4289
 
#14 0x0000aaaab4a759a8 in BackendRun (port=0xaaaad5d76150, port=0xaaaad5d76150) at
./build/../src/backend/postmaster/postmaster.c:4517
#15 BackendStartup (port=0xaaaad5d76150) at ./build/../src/backend/postmaster/postmaster.c:4200
#16 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1725
#17 0x0000aaaab4a769d4 in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at
./build/../src/backend/postmaster/postmaster.c:1398
#18 0x0000aaaab480355c in main (argc=5, argv=0xaaaad5d16720) at ./build/../src/backend/main/main.c:228

Mon Jul  3 12:51:16 UTC 2023
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
hash_seq_search (status=status@entry=0xfffff0173f40) at ./build/../src/backend/utils/hash/dynahash.c:1439
1439    ./build/../src/backend/utils/hash/dynahash.c: No such file or directory.
#0  hash_seq_search (status=status@entry=0xfffff0173f40) at ./build/../src/backend/utils/hash/dynahash.c:1439
#1  0x0000ffffa1bc8714 in rel_sync_cache_publication_cb (arg=<optimized out>, cacheid=<optimized out>,
hashvalue=<optimizedout>) at ./build/../src/backend/replication/pgoutput/pgoutput.c:665
 
#2  0x0000aaaab4bfcef4 in CallSyscacheCallbacks (cacheid=47, hashvalue=3011071378) at
./build/../src/backend/utils/cache/inval.c:1520
#3  0x0000aaaab4a91884 in ReorderBufferExecuteInvalidations (rb=0xffffa4ac5308 <malloc+160>, txn=0xfffff0174240,
txn=0xfffff0174240)at ./build/../src/backend/replication/logical/reorderbuffer.c:2187 
#4  ReorderBufferCommit (rb=0xffffa4ac5308 <malloc+160>, xid=xid@entry=2741814901, commit_lsn=187650155969544,
end_lsn=<optimizedout>, commit_time=commit_time@entry=741514150878208, origin_id=origin_id@entry=0,
origin_lsn=origin_lsn@entry=0)at ./build/../src/backend/replication/logical/reorderbuffer.c:1816
 
#5  0x0000aaaab4a869bc in DecodeCommit (xid=2741814901, parsed=0xfffff0174390, buf=0xfffff0174510, ctx=0xaaaad5e1df00)
at./build/../src/backend/replication/logical/decode.c:654
 
#6  DecodeXactOp (ctx=ctx@entry=0xaaaad5e1df00, buf=0xfffff0174510, buf@entry=0xfffff0174570) at
./build/../src/backend/replication/logical/decode.c:249
#7  0x0000aaaab4a86ad4 in LogicalDecodingProcessRecord (ctx=0xaaaad5e1df00, record=0xaaaad5e1e198) at
./build/../src/backend/replication/logical/decode.c:117
#8  0x0000aaaab4a996ec in XLogSendLogical () at ./build/../src/backend/replication/walsender.c:2883
#9  0x0000aaaab4a9bbb0 in WalSndLoop (send_data=send_data@entry=0xaaaab4a99688 <XLogSendLogical>) at
./build/../src/backend/replication/walsender.c:2232
#10 0x0000aaaab4a9c674 in StartLogicalReplication (cmd=0xaaaad5e47f90) at
./build/../src/backend/replication/walsender.c:1134
#11 exec_replication_command (cmd_string=cmd_string@entry=0xaaaad5d1db00 "START_REPLICATION SLOT \"slot_name\" LOGICAL
1D6C/92965050(\"proto_version\" '1', \"publication_names\" 'xxx')") at
./build/../src/backend/replication/walsender.c:1602
#12 0x0000aaaab4af0c08 in PostgresMain (argc=<optimized out>, argv=argv@entry=0xaaaad5d7aaf8, dbname=<optimized out>,
username=<optimizedout>) at ./build/../src/backend/tcop/postgres.c:4289
 
#13 0x0000aaaab4a759a8 in BackendRun (port=0xaaaad5d76150, port=0xaaaad5d76150) at
./build/../src/backend/postmaster/postmaster.c:4517
#14 BackendStartup (port=0xaaaad5d76150) at ./build/../src/backend/postmaster/postmaster.c:4200
#15 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1725
#16 0x0000aaaab4a769d4 in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at
./build/../src/backend/postmaster/postmaster.c:1398
#17 0x0000aaaab480355c in main (argc=5, argv=0xaaaad5d16720) at ./build/../src/backend/main/main.c:228
#v-

Based on suggestion from IRC, i tried "return 0" and contiunue in gdb session.

backtrace changed to:

#v+
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
0x0000aaaab4bfcf84 in LocalExecuteInvalidationMessage (msg=0xffff9ff33e18) at
./build/../src/backend/utils/cache/inval.c:578
578     ./build/../src/backend/utils/cache/inval.c: No such file or directory.
#0  0x0000aaaab4bfcf84 in LocalExecuteInvalidationMessage (msg=0xffff9ff33e18) at
./build/../src/backend/utils/cache/inval.c:578
#1  0x0000aaaab4a91884 in ReorderBufferExecuteInvalidations (rb=0xffffa4ac5308 <malloc+160>, txn=0xfffff0174240,
txn=0xfffff0174240)at ./build/../src/backend/replication/logical/reorderbuffer.c:2187
 
#2  ReorderBufferCommit (rb=0xffffa4ac5308 <malloc+160>, xid=xid@entry=2741814901, commit_lsn=187650155969544,
end_lsn=<optimizedout>, commit_time=commit_time@entry=741514150878208, origin_id=origin_id@entry=0,
origin_lsn=origin_lsn@entry=0)at ./build/../src/backend/replication/logical/reorderbuffer.c:1816
 
#3  0x0000aaaab4a869bc in DecodeCommit (xid=2741814901, parsed=0xfffff0174390, buf=0xfffff0174510, ctx=0xaaaad5e1df00)
at./build/../src/backend/replication/logical/decode.c:654
 
#4  DecodeXactOp (ctx=ctx@entry=0xaaaad5e1df00, buf=0xfffff0174510, buf@entry=0xfffff0174570) at
./build/../src/backend/replication/logical/decode.c:249
#5  0x0000aaaab4a86ad4 in LogicalDecodingProcessRecord (ctx=0xaaaad5e1df00, record=0xaaaad5e1e198) at
./build/../src/backend/replication/logical/decode.c:117
#6  0x0000aaaab4a996ec in XLogSendLogical () at ./build/../src/backend/replication/walsender.c:2883
#7  0x0000aaaab4a9bbb0 in WalSndLoop (send_data=send_data@entry=0xaaaab4a99688 <XLogSendLogical>) at
./build/../src/backend/replication/walsender.c:2232
#8  0x0000aaaab4a9c674 in StartLogicalReplication (cmd=0xaaaad5e47f90) at
./build/../src/backend/replication/walsender.c:1134
#9  exec_replication_command (cmd_string=cmd_string@entry=0xaaaad5d1db00 "START_REPLICATION SLOT
\"data_access_platform_cdc\"LOGICAL 1D6C/92965050 (\"proto_version\" '1', \"publication_names\" 'cdc')") at
./build/../src/backend/replication/walsender.c:1602
#10 0x0000aaaab4af0c08 in PostgresMain (argc=<optimized out>, argv=argv@entry=0xaaaad5d7aaf8, dbname=<optimized out>,
username=<optimizedout>) at ./build/../src/backend/tcop/postgres.c:4289
 
#11 0x0000aaaab4a759a8 in BackendRun (port=0xaaaad5d76150, port=0xaaaad5d76150) at
./build/../src/backend/postmaster/postmaster.c:4517
#12 BackendStartup (port=0xaaaad5d76150) at ./build/../src/backend/postmaster/postmaster.c:4200
#13 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1725
#14 0x0000aaaab4a769d4 in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at
./build/../src/backend/postmaster/postmaster.c:1398
#15 0x0000aaaab480355c in main (argc=5, argv=0xaaaad5d16720) at ./build/../src/backend/main/main.c:228
#v-

and then it went back to hash_seq_search  :(

Anything I can do about it?

Best regards,

depesz




Re: Backend handling replication slot stuck using 100% cpu, unkillable

От
shveta malik
Дата:
On Mon, Jul 3, 2023 at 6:28 PM hubert depesz lubaczewski
<depesz@depesz.com> wrote:
>
> On Mon, Jul 03, 2023 at 01:36:32PM +0200, hubert depesz lubaczewski wrote:
> > Hi,
> > we are using debezium to get change data from Pg.
> >
> > This particular pg is 12.9, and will be soon upgrade to 14.something
> > (this thursday).
>
> So, i installed dbgsym for this pg, and this bny accident upgraded pg to
> 12.14.
>
> Now I do have debug symbols, though, so backtrace can be more
> informative.
>
> I ran:
>
> for i in 1 2 3; do date; sudo gdb -batch -p 8938 -ex bt; sleep 30; echo; done
>
> And got this:
>
> #v+
> Mon Jul  3 12:50:14 UTC 2023
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
> hash_seq_search (status=status@entry=0xfffff0173f40) at ./build/../src/backend/utils/hash/dynahash.c:1439
> 1439    ./build/../src/backend/utils/hash/dynahash.c: No such file or directory.
> #0  hash_seq_search (status=status@entry=0xfffff0173f40) at ./build/../src/backend/utils/hash/dynahash.c:1439
> #1  0x0000ffffa1bc8714 in rel_sync_cache_publication_cb (arg=<optimized out>, cacheid=<optimized out>,
hashvalue=<optimizedout>) at ./build/../src/backend/replication/pgoutput/pgoutput.c:665 
> #2  0x0000aaaab4bfcef4 in CallSyscacheCallbacks (cacheid=47, hashvalue=1542357812) at
./build/../src/backend/utils/cache/inval.c:1520
> #3  0x0000aaaab4a91884 in ReorderBufferExecuteInvalidations (rb=0xffffa4ac5308 <malloc+160>, txn=0xfffff0174240,
txn=0xfffff0174240)at ./build/../src/backend/replication/logical/reorderbuffer.c:2187 
> #4  ReorderBufferCommit (rb=0xffffa4ac5308 <malloc+160>, xid=xid@entry=2741814901, commit_lsn=187650155969544,
end_lsn=<optimizedout>, commit_time=commit_time@entry=741514150878208, origin_id=origin_id@entry=0,
origin_lsn=origin_lsn@entry=0)at ./build/../src/backend/replication/logical/reorderbuffer.c:1816 
> #5  0x0000aaaab4a869bc in DecodeCommit (xid=2741814901, parsed=0xfffff0174390, buf=0xfffff0174510,
ctx=0xaaaad5e1df00)at ./build/../src/backend/replication/logical/decode.c:654 
> #6  DecodeXactOp (ctx=ctx@entry=0xaaaad5e1df00, buf=0xfffff0174510, buf@entry=0xfffff0174570) at
./build/../src/backend/replication/logical/decode.c:249
> #7  0x0000aaaab4a86ad4 in LogicalDecodingProcessRecord (ctx=0xaaaad5e1df00, record=0xaaaad5e1e198) at
./build/../src/backend/replication/logical/decode.c:117
> #8  0x0000aaaab4a996ec in XLogSendLogical () at ./build/../src/backend/replication/walsender.c:2883
> #9  0x0000aaaab4a9bbb0 in WalSndLoop (send_data=send_data@entry=0xaaaab4a99688 <XLogSendLogical>) at
./build/../src/backend/replication/walsender.c:2232
> #10 0x0000aaaab4a9c674 in StartLogicalReplication (cmd=0xaaaad5e47f90) at
./build/../src/backend/replication/walsender.c:1134
> #11 exec_replication_command (cmd_string=cmd_string@entry=0xaaaad5d1db00 "START_REPLICATION SLOT \"slot_name\"
LOGICAL1D6C/92965050 (\"proto_version\" '1', \"publication_names\" 'xxx')") at
./build/../src/backend/replication/walsender.c:1602
> #12 0x0000aaaab4af0c08 in PostgresMain (argc=<optimized out>, argv=argv@entry=0xaaaad5d7aaf8, dbname=<optimized out>,
username=<optimizedout>) at ./build/../src/backend/tcop/postgres.c:4289 
> #13 0x0000aaaab4a759a8 in BackendRun (port=0xaaaad5d76150, port=0xaaaad5d76150) at
./build/../src/backend/postmaster/postmaster.c:4517
> #14 BackendStartup (port=0xaaaad5d76150) at ./build/../src/backend/postmaster/postmaster.c:4200
> #15 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1725
> #16 0x0000aaaab4a769d4 in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at
./build/../src/backend/postmaster/postmaster.c:1398
> #17 0x0000aaaab480355c in main (argc=5, argv=0xaaaad5d16720) at ./build/../src/backend/main/main.c:228
>
> Mon Jul  3 12:50:45 UTC 2023
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
> 0x0000aaaab4c1e11c in hash_search_with_hash_value (hashp=0xaaaad5dea900, keyPtr=0xfffff0173f5c,
keyPtr@entry=0xfffff0173f7c,hashvalue=1107843932, action=action@entry=HASH_FIND, foundPtr=foundPtr@entry=0x0) at
./build/../src/backend/utils/hash/dynahash.c:949
> 949     ./build/../src/backend/utils/hash/dynahash.c: No such file or directory.
> #0  0x0000aaaab4c1e11c in hash_search_with_hash_value (hashp=0xaaaad5dea900, keyPtr=0xfffff0173f5c,
keyPtr@entry=0xfffff0173f7c,hashvalue=1107843932, action=action@entry=HASH_FIND, foundPtr=foundPtr@entry=0x0) at
./build/../src/backend/utils/hash/dynahash.c:949
> #1  0x0000aaaab4c1e79c in hash_search (hashp=<optimized out>, keyPtr=keyPtr@entry=0xfffff0173f7c,
action=action@entry=HASH_FIND,foundPtr=foundPtr@entry=0x0) at ./build/../src/backend/utils/hash/dynahash.c:911 
> #2  0x0000aaaab4c07180 in RelationCacheInvalidateEntry (relationId=<optimized out>) at
./build/../src/backend/utils/cache/relcache.c:2820
> #3  0x0000aaaab4bfcfe0 in LocalExecuteInvalidationMessage (msg=0xffffa033ab88) at
./build/../src/backend/utils/cache/inval.c:603
> #4  0x0000aaaab4a91884 in ReorderBufferExecuteInvalidations (rb=0xffffa4ac5308 <malloc+160>, txn=0xfffff0174240,
txn=0xfffff0174240)at ./build/../src/backend/replication/logical/reorderbuffer.c:2187 
> #5  ReorderBufferCommit (rb=0xffffa4ac5308 <malloc+160>, xid=xid@entry=2741814901, commit_lsn=187650155969544,
end_lsn=<optimizedout>, commit_time=commit_time@entry=741514150878208, origin_id=origin_id@entry=0,
origin_lsn=origin_lsn@entry=0)at ./build/../src/backend/replication/logical/reorderbuffer.c:1816 
> #6  0x0000aaaab4a869bc in DecodeCommit (xid=2741814901, parsed=0xfffff0174390, buf=0xfffff0174510,
ctx=0xaaaad5e1df00)at ./build/../src/backend/replication/logical/decode.c:654 
> #7  DecodeXactOp (ctx=ctx@entry=0xaaaad5e1df00, buf=0xfffff0174510, buf@entry=0xfffff0174570) at
./build/../src/backend/replication/logical/decode.c:249
> #8  0x0000aaaab4a86ad4 in LogicalDecodingProcessRecord (ctx=0xaaaad5e1df00, record=0xaaaad5e1e198) at
./build/../src/backend/replication/logical/decode.c:117
> #9  0x0000aaaab4a996ec in XLogSendLogical () at ./build/../src/backend/replication/walsender.c:2883
> #10 0x0000aaaab4a9bbb0 in WalSndLoop (send_data=send_data@entry=0xaaaab4a99688 <XLogSendLogical>) at
./build/../src/backend/replication/walsender.c:2232
> #11 0x0000aaaab4a9c674 in StartLogicalReplication (cmd=0xaaaad5e47f90) at
./build/../src/backend/replication/walsender.c:1134
> #12 exec_replication_command (cmd_string=cmd_string@entry=0xaaaad5d1db00 "START_REPLICATION SLOT \"slot_name\"
LOGICAL1D6C/92965050 (\"proto_version\" '1', \"publication_names\" 'xxx')") at
./build/../src/backend/replication/walsender.c:1602
> #13 0x0000aaaab4af0c08 in PostgresMain (argc=<optimized out>, argv=argv@entry=0xaaaad5d7aaf8, dbname=<optimized out>,
username=<optimizedout>) at ./build/../src/backend/tcop/postgres.c:4289 
> #14 0x0000aaaab4a759a8 in BackendRun (port=0xaaaad5d76150, port=0xaaaad5d76150) at
./build/../src/backend/postmaster/postmaster.c:4517
> #15 BackendStartup (port=0xaaaad5d76150) at ./build/../src/backend/postmaster/postmaster.c:4200
> #16 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1725
> #17 0x0000aaaab4a769d4 in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at
./build/../src/backend/postmaster/postmaster.c:1398
> #18 0x0000aaaab480355c in main (argc=5, argv=0xaaaad5d16720) at ./build/../src/backend/main/main.c:228
>
> Mon Jul  3 12:51:16 UTC 2023
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
> hash_seq_search (status=status@entry=0xfffff0173f40) at ./build/../src/backend/utils/hash/dynahash.c:1439
> 1439    ./build/../src/backend/utils/hash/dynahash.c: No such file or directory.
> #0  hash_seq_search (status=status@entry=0xfffff0173f40) at ./build/../src/backend/utils/hash/dynahash.c:1439
> #1  0x0000ffffa1bc8714 in rel_sync_cache_publication_cb (arg=<optimized out>, cacheid=<optimized out>,
hashvalue=<optimizedout>) at ./build/../src/backend/replication/pgoutput/pgoutput.c:665 
> #2  0x0000aaaab4bfcef4 in CallSyscacheCallbacks (cacheid=47, hashvalue=3011071378) at
./build/../src/backend/utils/cache/inval.c:1520
> #3  0x0000aaaab4a91884 in ReorderBufferExecuteInvalidations (rb=0xffffa4ac5308 <malloc+160>, txn=0xfffff0174240,
txn=0xfffff0174240)at ./build/../src/backend/replication/logical/reorderbuffer.c:2187 
> #4  ReorderBufferCommit (rb=0xffffa4ac5308 <malloc+160>, xid=xid@entry=2741814901, commit_lsn=187650155969544,
end_lsn=<optimizedout>, commit_time=commit_time@entry=741514150878208, origin_id=origin_id@entry=0,
origin_lsn=origin_lsn@entry=0)at ./build/../src/backend/replication/logical/reorderbuffer.c:1816 
> #5  0x0000aaaab4a869bc in DecodeCommit (xid=2741814901, parsed=0xfffff0174390, buf=0xfffff0174510,
ctx=0xaaaad5e1df00)at ./build/../src/backend/replication/logical/decode.c:654 
> #6  DecodeXactOp (ctx=ctx@entry=0xaaaad5e1df00, buf=0xfffff0174510, buf@entry=0xfffff0174570) at
./build/../src/backend/replication/logical/decode.c:249
> #7  0x0000aaaab4a86ad4 in LogicalDecodingProcessRecord (ctx=0xaaaad5e1df00, record=0xaaaad5e1e198) at
./build/../src/backend/replication/logical/decode.c:117
> #8  0x0000aaaab4a996ec in XLogSendLogical () at ./build/../src/backend/replication/walsender.c:2883
> #9  0x0000aaaab4a9bbb0 in WalSndLoop (send_data=send_data@entry=0xaaaab4a99688 <XLogSendLogical>) at
./build/../src/backend/replication/walsender.c:2232
> #10 0x0000aaaab4a9c674 in StartLogicalReplication (cmd=0xaaaad5e47f90) at
./build/../src/backend/replication/walsender.c:1134
> #11 exec_replication_command (cmd_string=cmd_string@entry=0xaaaad5d1db00 "START_REPLICATION SLOT \"slot_name\"
LOGICAL1D6C/92965050 (\"proto_version\" '1', \"publication_names\" 'xxx')") at
./build/../src/backend/replication/walsender.c:1602
> #12 0x0000aaaab4af0c08 in PostgresMain (argc=<optimized out>, argv=argv@entry=0xaaaad5d7aaf8, dbname=<optimized out>,
username=<optimizedout>) at ./build/../src/backend/tcop/postgres.c:4289 
> #13 0x0000aaaab4a759a8 in BackendRun (port=0xaaaad5d76150, port=0xaaaad5d76150) at
./build/../src/backend/postmaster/postmaster.c:4517
> #14 BackendStartup (port=0xaaaad5d76150) at ./build/../src/backend/postmaster/postmaster.c:4200
> #15 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1725
> #16 0x0000aaaab4a769d4 in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at
./build/../src/backend/postmaster/postmaster.c:1398
> #17 0x0000aaaab480355c in main (argc=5, argv=0xaaaad5d16720) at ./build/../src/backend/main/main.c:228
> #v-
>
> Based on suggestion from IRC, i tried "return 0" and contiunue in gdb session.
>
> backtrace changed to:
>
> #v+
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
> 0x0000aaaab4bfcf84 in LocalExecuteInvalidationMessage (msg=0xffff9ff33e18) at
./build/../src/backend/utils/cache/inval.c:578
> 578     ./build/../src/backend/utils/cache/inval.c: No such file or directory.
> #0  0x0000aaaab4bfcf84 in LocalExecuteInvalidationMessage (msg=0xffff9ff33e18) at
./build/../src/backend/utils/cache/inval.c:578
> #1  0x0000aaaab4a91884 in ReorderBufferExecuteInvalidations (rb=0xffffa4ac5308 <malloc+160>, txn=0xfffff0174240,
txn=0xfffff0174240)at ./build/../src/backend/replication/logical/reorderbuffer.c:2187 
> #2  ReorderBufferCommit (rb=0xffffa4ac5308 <malloc+160>, xid=xid@entry=2741814901, commit_lsn=187650155969544,
end_lsn=<optimizedout>, commit_time=commit_time@entry=741514150878208, origin_id=origin_id@entry=0,
origin_lsn=origin_lsn@entry=0)at ./build/../src/backend/replication/logical/reorderbuffer.c:1816 
> #3  0x0000aaaab4a869bc in DecodeCommit (xid=2741814901, parsed=0xfffff0174390, buf=0xfffff0174510,
ctx=0xaaaad5e1df00)at ./build/../src/backend/replication/logical/decode.c:654 
> #4  DecodeXactOp (ctx=ctx@entry=0xaaaad5e1df00, buf=0xfffff0174510, buf@entry=0xfffff0174570) at
./build/../src/backend/replication/logical/decode.c:249
> #5  0x0000aaaab4a86ad4 in LogicalDecodingProcessRecord (ctx=0xaaaad5e1df00, record=0xaaaad5e1e198) at
./build/../src/backend/replication/logical/decode.c:117
> #6  0x0000aaaab4a996ec in XLogSendLogical () at ./build/../src/backend/replication/walsender.c:2883
> #7  0x0000aaaab4a9bbb0 in WalSndLoop (send_data=send_data@entry=0xaaaab4a99688 <XLogSendLogical>) at
./build/../src/backend/replication/walsender.c:2232
> #8  0x0000aaaab4a9c674 in StartLogicalReplication (cmd=0xaaaad5e47f90) at
./build/../src/backend/replication/walsender.c:1134
> #9  exec_replication_command (cmd_string=cmd_string@entry=0xaaaad5d1db00 "START_REPLICATION SLOT
\"data_access_platform_cdc\"LOGICAL 1D6C/92965050 (\"proto_version\" '1', \"publication_names\" 'cdc')") at
./build/../src/backend/replication/walsender.c:1602
> #10 0x0000aaaab4af0c08 in PostgresMain (argc=<optimized out>, argv=argv@entry=0xaaaad5d7aaf8, dbname=<optimized out>,
username=<optimizedout>) at ./build/../src/backend/tcop/postgres.c:4289 
> #11 0x0000aaaab4a759a8 in BackendRun (port=0xaaaad5d76150, port=0xaaaad5d76150) at
./build/../src/backend/postmaster/postmaster.c:4517
> #12 BackendStartup (port=0xaaaad5d76150) at ./build/../src/backend/postmaster/postmaster.c:4200
> #13 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1725
> #14 0x0000aaaab4a769d4 in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at
./build/../src/backend/postmaster/postmaster.c:1398
> #15 0x0000aaaab480355c in main (argc=5, argv=0xaaaad5d16720) at ./build/../src/backend/main/main.c:228
> #v-
>
> and then it went back to hash_seq_search  :(
>
> Anything I can do about it?
>

Please check email thread [1]. Going by the stack trace shared by you,
this problem looks exactly the same. I feel upgrading to PG14 will
probably solve this issue.

[1]: https://www.postgresql.org/message-id/flat/17716-1fe42e7b44fc2f25%40postgresql.org

thanks
Shveta



Re: Backend handling replication slot stuck using 100% cpu, unkillable

От
Tomas Vondra
Дата:
On 7/3/23 14:58, hubert depesz lubaczewski wrote:
> ...
> 
> and then it went back to hash_seq_search  :(
>

So is it an infinite loop in ReorderBufferExecuteInvalidations, or is it
just the case that there are many invalidations? I can't really deduce
that from the backtraces.

How many invalidations does the transaction have? Should be enough to

  print txn->ninvalidations

Also, is there anything interesting about the transaction? You know the
XID (2741814901) so maybe use pg_waldump to see what it did.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Backend handling replication slot stuck using 100% cpu, unkillable

От
hubert depesz lubaczewski
Дата:
On Tue, Jul 04, 2023 at 08:56:53AM +0530, shveta malik wrote:
> Please check email thread [1]. Going by the stack trace shared by you,
> this problem looks exactly the same. I feel upgrading to PG14 will
> probably solve this issue.

Thanks a lot, will read, and hopefully it will provide some more
background/hope/info :)

Best regards,

depesz




Re: Backend handling replication slot stuck using 100% cpu, unkillable

От
hubert depesz lubaczewski
Дата:
On Tue, Jul 04, 2023 at 01:30:21PM +0200, Tomas Vondra wrote:
> On 7/3/23 14:58, hubert depesz lubaczewski wrote:
> So is it an infinite loop in ReorderBufferExecuteInvalidations, or is it
> just the case that there are many invalidations? I can't really deduce
> that from the backtraces.
> 
> How many invalidations does the transaction have? Should be enough to
> 
>   print txn->ninvalidations
> 
> Also, is there anything interesting about the transaction? You know the
> XID (2741814901) so maybe use pg_waldump to see what it did.

I don't have the process anymore. Pg was restarted, and the app was
forced to not take data from this place.

I can run pg_waldump, no problem, but I'm not an expert on this. Which
wal file(s) should I try to dump? and what to look for (or should I just
provide full output of dump?

Best regards,

depesz




Re: Backend handling replication slot stuck using 100% cpu, unkillable

От
Tomas Vondra
Дата:

On 7/4/23 13:53, hubert depesz lubaczewski wrote:
> On Tue, Jul 04, 2023 at 01:30:21PM +0200, Tomas Vondra wrote:
>> On 7/3/23 14:58, hubert depesz lubaczewski wrote:
>> So is it an infinite loop in ReorderBufferExecuteInvalidations, or is it
>> just the case that there are many invalidations? I can't really deduce
>> that from the backtraces.
>>
>> How many invalidations does the transaction have? Should be enough to
>>
>>   print txn->ninvalidations
>>
>> Also, is there anything interesting about the transaction? You know the
>> XID (2741814901) so maybe use pg_waldump to see what it did.
> 
> I don't have the process anymore. Pg was restarted, and the app was
> forced to not take data from this place.
> 

OK. So the slot is still there, and we could try reading the data again
(hopefully getting stuck in the same place).

> I can run pg_waldump, no problem, but I'm not an expert on this. Which
> wal file(s) should I try to dump? and what to look for (or should I just
> provide full output of dump?
> 

The backtrace has this:

and 187650155969544 should be LSN AAAA/B4E37C08, which maps to

select pg_walfile_name('AAAA/B4E37C08');
     pg_walfile_name
--------------------------
 000000010000AAAA000000B4

So maybe try dumping that WAL segment and see if the XID 2741814901 is
there. If yes, maybe share the whole dump, it's difficult to say what to
share without knowing what the transaction did.

There's also the LSNs in pg_replication_slots - it might be interesting
to see dump for all the WAL between restart_lsn and confirmed_flush_lsn.
So like this segments between:

  SELECT pg_walfile_name(restart_lsn),
         pg_walfile_name(confirmed_flush_lsn)
    FROM pg_replication_slots WHERE slot_name = 'slot_name';


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Backend handling replication slot stuck using 100% cpu, unkillable

От
hubert depesz lubaczewski
Дата:
On Tue, Jul 04, 2023 at 03:04:58PM +0200, Tomas Vondra wrote:
> The backtrace has this:
> and 187650155969544 should be LSN AAAA/B4E37C08, which maps to
> select pg_walfile_name('AAAA/B4E37C08');
>      pg_walfile_name
> --------------------------
>  000000010000AAAA000000B4

We're not that far. Current WAL is 0000000100001D77000000AF

There was command in there, though that referred to wal
0000000100001D6C00000092
I checked pg_waldump of this wal, and found lots of DELETE and NEW_CID
commands on 2 relations: pg_depend and pg_publication_rel

This kinda makes sense given that around that time there was drop of
replication slot, and earlier drop of all tables from it.

Best regards,

depesz