Re: BUG #17580: use pg_terminate_backend to terminate a wal sender process may wait a long time

Поиск
Список
Период
Сортировка
От 巨鲸
Тема Re: BUG #17580: use pg_terminate_backend to terminate a wal sender process may wait a long time
Дата
Msg-id tencent_9E39E51A04CA7F2A1837F5D26734A79A2605@qq.com
обсуждение исходный текст
Ответ на Re: BUG #17580: use pg_terminate_backend to terminate a wal sender process may wait a long time  (Masahiko Sawada <sawada.mshk@gmail.com>)
Список pgsql-bugs
Hi Masahiko,
I think maybe you should use filter-tables to filter the test table, likes:
       filter-tables="public.test666"

Then it will not go to WalSndWriteData and loop a long time in ReorderBufferCommit, and the bug may be reproduced.


Regards,

--
Whale Song


------------------ Original ------------------
From:  "Masahiko Sawada";<sawada.mshk@gmail.com>;
Send time: Friday, Aug 12, 2022 11:28 AM
To: "巨鲸"<13207961@qq.com>; "pgsql-bugs"<pgsql-bugs@lists.postgresql.org>;
Subject:  Re: BUG #17580: use pg_terminate_backend to terminate a wal sender process may wait a long time

On Wed, Aug 10, 2022 at 5:50 PM PG Bug reporting form
<noreply@postgresql.org> wrote:
>
> The following bug has been logged on the website:
>
> Bug reference:      17580
> Logged by:          whale song
> Email address:      13207961@qq.com
> PostgreSQL version: 10.5
> Operating system:   centos
> Description:
>
> 1) Create a table for test
> create table test666(id int primary key, number int);
>
> 2) Use pg_recvlogical to subscribe
> 2.1) Create a replicate slot
> pg_recvlogical -h 9.223.28.176 -p 11002 -d test -U tbase --slot
> wal2json_pg_test02 --create-slot -P wal2json
> 2.2) Start to subscribe, and use filter-tables to filter test table
> test666
> pg_recvlogical -h 9.223.28.176 -p 11002 -d test -U tbase --slot
> wal2json_pg_test02 --start -o pretty-print=1 -o format-version=2 -o
> filter-tables="public.test666"  -f -
>
> 3) Insert a lot of data to the table in one transaction
> insert into test666 values (generate_series(1000000001,2000000000),
> generate_series(1000000001,2000000000));
> It may cause a long time, maybe 50 minutes.
>
> 4)  Use pg_replication_slots to get wal sender process pid(active_pid)
> select *,pg_current_wal_lsn() from pg_replication_slots ;
>
> 5) When step 3 finish, use pstack to watch wal sender process stack, when it
> is loop in ReorderBufferCommit, kill pg_recvlogical which start in step
> 2.2
>  pstack 80743
> #0 0x0000000000c400d9 in ResourceArrayAdd (resarr=0x1576970,
> value=139941599240896) at resowner.c:359
> #1 0x0000000000c3fef4 in ResourceArrayEnlarge (resarr=0x1576970) at
> resowner.c:252
> #2 0x0000000000c411f4 in ResourceOwnerEnlargeRelationRefs (owner=0x15768f0)
> at resowner.c:1129
> #3 0x0000000000be0899 in RelationIncrementReferenceCount
> (rel=0x7f46b14fe2c0) at relcache.c:2589
> #4 0x0000000000be05da in RelationIdGetRelation (relationId=1177785) at
> relcache.c:2451
> #5 0x00007f4e0d649b9e in pg_decode_write_tuple (ctx=0x13fc5a8,
> relation=0x7f46b14fcf38, tuple=0x315b910, kind=PGOUTPUTJSON_IDENTITY) at
> wal2json.c:1503
> #6 0x00007f4e0d64a514 in pg_decode_write_change (ctx=0x13fc5a8,
> txn=0x15957a8, relation=0x7f46b14fcf38, change=0x33bc9f8) at
> wal2json.c:1747
> #7 0x00007f4e0d64a9db in pg_decode_change_v2 (ctx=0x13fc5a8, txn=0x15957a8,
> relation=0x7f46b14fcf38, change=0x33bc9f8) at wal2json.c:1865
> #8 0x00007f4e0d648b2b in pg_decode_change (ctx=0x13fc5a8, txn=0x15957a8,
> relation=0x7f46b14fcf38, change=0x33bc9f8) at wal2json.c:1098
> #9 0x00000000009cf3d1 in change_cb_wrapper (cache=0x1410fb8, txn=0x15957a8,
> relation=0x7f46b14fcf38, change=0x33bc9f8) at logical.c:730
> #10 0x00000000009d75df in ReorderBufferCommit (rb=0x1410fb8, xid=174028801,
> commit_lsn=44579135148872, end_lsn=44579135148936,
> commit_time=712222669188947, origin_id=0, origin_lsn=0) at
> reorderbuffer.c:1445
> #11 0x00000000009cb15b in DecodeCommit (ctx=0x13fc5a8, buf=0x7fff2574d600,
> parsed=0x7fff2574d530, xid=174028801) at decode.c:713
> #12 0x00000000009ca76b in DecodeXactOp (ctx=0x13fc5a8, buf=0x7fff2574d600)
> at decode.c:335
> #13 0x00000000009ca401 in LogicalDecodingProcessRecord (ctx=0x13fc5a8,
> record=0x13fc880) at decode.c:175
> #14 0x00000000009ead45 in XLogSendLogical () at walsender.c:3152
> #15 0x00000000009ea09f in WalSndLoop (send_data=0x9eaca9 <XLogSendLogical>)
> at walsender.c:2520
> #16 0x00000000009e8aed in StartLogicalReplication (cmd=0x151bb48) at
> walsender.c:1445
> #17 0x00000000009e944e in exec_replication_command (cmd_string=0x1520408
> "START_REPLICATION SLOT subs_f524p217b7_dn0001 LOGICAL 285C/F8FFEA50 (
> \"format-version\" '2' )") at walsender.c:1887
> #18 0x0000000000a71ed8 in PostgresMain (argc=1, argv=0x1419f58,
> dbname=0x1419f28 "prd_jxcard_lingwang", username=0x1419f08 "mingchen_read")
> at postgres.c:6118
> #19 0x00000000009b0faa in BackendRun (port=0x14135b0) at postmaster.c:5071
> #20 0x00000000009b0753 in BackendStartup (port=0x14135b0) at
> postmaster.c:4743
> #21 0x00000000009ac3ae in ServerLoop () at postmaster.c:1995
> #22 0x00000000009aba35 in PostmasterMain (argc=5, argv=0x13dc770) at
> postmaster.c:1603
> #23 0x0000000000878c44 in main (argc=5, argv=0x13dc770) at main.c:233
>
> 6) Use pg_replication_slots to query the replication slots infomation, the
> active_pid is still active. Use pg_terminate_backend to terminate
> active_pid, it returns true, but the active_pid is still active.
>
> 7) The active_pid wal sender process exit after a long time waiting?
>

Hmm, I could not reproduce this behavior in my environment. As we have
CHECK_FOR_INTERRUPTS in WalSndWriteData() and the walsender exits if
it fails to flush, I think the walsender process should exit if it
either receives SIGTERM or failed to send the data. Did you see the
log "LOG:  could not send data to client: Broken pipe" after
terminating pg_recvlogical?

If it's reproducible in your environment, could you get the stack
trace of the walsender process that keeps running? and please try with
the latest version, 10.22.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: [PATCH] Fix segfault calling PQflush on invalid connection
Следующее
От: Peter Smith
Дата:
Сообщение: Re: No-op updates with partitioning and logical replication started failing in version 13