Re: BUG #17716: walsender process hang while decoding 'DROP PUBLICATION' XLOG

Поиск
Список
Период
Сортировка
От shveta malik
Тема Re: BUG #17716: walsender process hang while decoding 'DROP PUBLICATION' XLOG
Дата
Msg-id CAJpy0uAnQHf05tGJ0jBtFz+L3R9vXmynT5kMSp+qYiOo=J=14g@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUG #17716: walsender process hang while decoding 'DROP PUBLICATION' XLOG  (Bowen Shi <zxwsbg12138@gmail.com>)
Ответы Re: BUG #17716: walsender process hang while decoding 'DROP PUBLICATION' XLOG
Список pgsql-bugs

Hello,
The problem is not reproducing on HEAD but reproducing on 13.9

Problem Description:
On drop of large publication with 10,000 tables added to it, walsender hangs in
ReorderBufferCommit -->ReorderBufferExecuteInvalidations->....hash_seq_search()

Root Cause:
Drop publication has 10000 tuples (table entries) to be deleted from catalogs. So it creates invalidation msgs to invalidate various cache entries. Each table-entry deletion results in 4 invalidation msgs. So for complete drop publication operation, it adds to 40,000 invalidation msgs aprox (+few more for publication drop itself). These invalidation msgs are added to top-txn.

Later when Walsender tries to read Xlog of 'drop publication' operation, it tries to replay these invalidation msgs for each command-id increment. Command-id is incremented from 1 to 10,000. So for each command-id, it replays all 40,000 invalidation msgs. There is no way to identify that only 4 invalidation msgs belong to a particular command-id and thus all accumulative msgs are played on each command-id increment. This results in a hang like situation in the above stack. And thus insert executed (belonging to different publication) immediately after drop publication is not received on subscriber.

What fixes it on HEAD:
On later versions, there is a new mechanism introduced where-in after each command id increment, we log invalidations to be done for that command-id. Record of type XLOG_XACT_INVALIDATIONS is inserted in XLOG. For the case in hand, each such invalidation record has 4 invalidation msgs for respective command-id. So all 40,000 invalidation msgs are inserted to Xlog but in pieces i.e. 10,000 records for invalidations, each record representing invalidations for 1 particular command id and each having 4 invalidation msgs.
Later when walsender tries to replay, it replays only that particualr's command-id xlog-record after each command-id increment (in the order these were inserted). Thus we perform invalidation 10,000 times but for 4 msgs at once, instead of 40,000 msgs every time in the previous version.

These changes are committed through below commit Ids:

Full change:
commit c55040ccd017962b7b8d3fbcdc184aa90c722a21
 Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Date:   Thu Jul 23 08:19:07 2020 +0530

Trivial Fix:
commit 36e545cd051ffe58064c1f38faeb64f4feef4aec
  Author: Amit Kapila
Date:   Thu Nov 10 16:56:49 2022 +0530


I have not tried it yet in version 14, hopefully we will see it working there. I will try and update it soon.

thanks
Shveta

On Tue, Dec 13, 2022 at 5:43 PM Bowen Shi <zxwsbg12138@gmail.com> wrote:
The following step can help reproducing the problem:
1. Run "initinstance.sh" and get two instance, one publisher and one
subscriber, this script will create a subscription as well;
2. Run “createschema.sh” and create 1000 schemas, each schema has 100 tables;
3. Run “createpublication.sh” and create a huge publication, which
publishes 10000 tables;
4. Run “insert.sh” and insert exactly one tuple into each table
created in Step 2;
5. Run “insert into t values(1);” in publisher, and then we can found
it in subscriber;
6. Run “drop publication pub_t_large;” in publisher;
7. Run “insert into t values(1);” in publisher, and then we can not
find it in subscriber, cause the walsender has hanged in decoding Step
6’s XLOG;



PG Bug reporting form <noreply@postgresql.org> 于2022年12月13日周二 15:51写道:
>
> The following bug has been logged on the website:
>
> Bug reference:      17716
> Logged by:          bowenshi
> Email address:      zxwsbg12138@gmail.com
> PostgreSQL version: 13.9
> Operating system:   centos
> Description:
>
> Dears, I have found a problem while using logical decoding.  I have two
> publications, and the walsender process hang while decoding DROP PUBLICATION
> XLOG of another publication. The pstack looks like following:
>
> #0  0x00000000008b9c24 in hash_seq_search ()
> #1  0x00007fd14340590c in rel_sync_cache_publication_cb () from
> pgoutput.so
> #2  0x0000000000897a1a in CallSyscacheCallbacks ()
> #3  0x000000000073682e in ReorderBufferCommit ()
> #4  0x000000000072b988 in DecodeXactOp ()
> #5  0x000000000072bd52 in LogicalDecodingProcessRecord ()
> #6  0x000000000074d14f in XLogSendLogical ()
> #7  0x000000000074f562 in WalSndLoop ()
> #8  0x0000000000750322 in exec_replication_command ()
> #9  0x0000000000792301 in PostgresMain ()
> #10 0x000000000048854f in ServerLoop ()
> #11 0x000000000071b579 in PostmasterMain ()
> #12 0x000000000048948d in main ()
>
> When I user perl to analyze, it spend almost 99% time in `hash_seq_search`
> function.
>
> This problem can be reproduced , and I will send the shell script later.
>

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: BUG #17717: Regression in vacuumdb (15 is slower than 10/11 and possible memory issue)
Следующее
От: Robins Tharakan
Дата:
Сообщение: Re: BUG #17713: Assert with postgres_fdw in v12