Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

Поиск
Список
Период
Сортировка
От Peter Smith
Тема Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop
Дата
Msg-id CAHut+Psprtsa4o89wtNnKLxxwXeDKAX9nNsdghT1Pv63siz+AA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop  (Amit Kapila <amit.kapila16@gmail.com>)
Ответы Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop  (Amit Kapila <amit.kapila16@gmail.com>)
Список pgsql-bugs
On Tue, Nov 17, 2020 at 1:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

> Yeah, this seems to be possible and this is the reason I mentioned
> above to dig more into this case. Did you try it via some test case? I
> think we can generate a test via debugger where after the tablesync
> worker reaches CATCHUP state stop it via debugger, then we can perform
> some large transaction on the same table which apply worker will skip
> and tablesync worker will try to apply changes and should fail.

Hello Amit.

FYI - This email is just to confirm that your above idea for debugging
the tablesync worker does work.

I have made a small temporary patch to aid testing this: PSA.

The patch just adds some more LOGs - it helps see what is going on.
The patch also gives a 30 second opportunity to attach to the
tablesync worker process.


----

So the necessary debugging steps are like this:

0. Start debugger
- Later we will attach to the tablesync worker process so it is
convenient to set all the breakpoint in advance
(gdb) b LogicalRepSyncTableStart
(gdb) b LogicalRepApplyLoop
(gdb) b apply_dispatch
(gdb) b process_syncing_tables_for_sync

1. Pub/Sub nodes: Create/Start Publisher and Subscriber nodes with
same test table.

2. Pub node: Add some initial data to the table (to give tablesync
init something to copy)

3. Pub node: CREATE PUBLISHER

4. Sub node: CREATE SUBSCRIBER
- at this point the slot gets created and the tablesync worker and
apply worker are launched
- that patch causes the tablesync worker to sleep 30 seconds to give
opportunity to attach to it in debugger

5. Debugger/tablesync: Attach to the tablesync worker
- continue
- breaks at LogicalRepSyncTableStart
- step over all the initial COPY code to the end of the function
(where apply worker has told it to CATCHUP)

6. Pub: While the tablesync worker is paused in debugger do some other
operations on the original table
- e.g. psql -d test_pub -c "INSERT INTO test_tab VALUES(1, 'foo');"

7. Debugger/tablesync: Allow the tablesync worker to continue
- tablesync worker will return from LogicalRepSyncTableStart and then
into LogicalRepApplyLoop
- if tablesync worker detects it is "behind" then the
LogicalRepApplyLoop will call apply_dispatch to deal with the message
resulting from the step 6 operation.

8. Debugger/tablesync: Some apply handlers (e.g. apply_handle_commit)
if not the LogicalRepApplyLoop itself, will end up calling the
process_syncing_tables.
- which calls process_syncing_tables_for_sync
- which will cause the tablesync worker to set SYNCDONE status just
before killing itself.

9. Sub node: The apply worker wait loop will now proceed again
tick/tick/tick every 1 second to process any incoming replication
messages.

---

Using the above technique, you can cause a tablesync worker to enter
some of the normal "apply" handling which might otherwise be very hard
to reproduce outside of the debugger.

---

Here is an example of some of the logging for the above debugging scenario:

2020-11-17 10:46:11.325 AEDT [9838] LOG:  !!>> apply worker: called
process_syncing_tables
2020-11-17 10:46:12.328 AEDT [9838] LOG:  !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:46:12.328 AEDT [9838] LOG:  !!>> apply worker: called
process_syncing_tables
2020-11-17 10:46:13.337 AEDT [9838] LOG:  !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:46:13.337 AEDT [9838] LOG:  !!>> apply worker: called
process_syncing_tables
2020-11-17 10:46:14.340 AEDT [9838] LOG:  !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:46:14.340 AEDT [9838] LOG:  !!>> apply worker: called
process_syncing_tables
2020-11-17 10:46:15.155 AEDT [9841] LOG:  !!>> tablesync worker: About
to call LogicalRepSyncTableStart to do initial syncing
2020-11-17 10:46:15.343 AEDT [9838] LOG:  !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:46:15.343 AEDT [9838] LOG:  !!>> apply worker: called
process_syncing_tables
2020-11-17 10:46:15.343 AEDT [9838] LOG:  !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:46:15.343 AEDT [9838] LOG:  !!>> apply worker: called
process_syncing_tables
2020-11-17 10:46:16.349 AEDT [9838] LOG:  !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:46:16.349 AEDT [9838] LOG:  !!>> apply worker: called
process_syncing_tables
2020-11-17 10:46:17.351 AEDT [9838] LOG:  !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:46:17.351 AEDT [9838] LOG:  !!>> apply worker: called
process_syncing_tables
2020-11-17 10:46:18.353 AEDT [9838] LOG:  !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:46:18.353 AEDT [9838] LOG:  !!>> apply worker: called
process_syncing_tables
2020-11-17 10:46:18.650 AEDT [10734] LOG:  logical decoding found
consistent point at 0/1B6D5E0
2020-11-17 10:46:18.650 AEDT [10734] DETAIL:  There are no running transactions.
2020-11-17 10:46:18.650 AEDT [10734] STATEMENT:
CREATE_REPLICATION_SLOT "tap_sub_24599_sync_16385" TEMPORARY LOGICAL
pgoutput USE_SNAPSHOT
2020-11-17 10:46:18.658 AEDT [9838] LOG:  !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:46:18.658 AEDT [9838] LOG:  !!>> apply worker: called
process_syncing_tables
2020-11-17 10:46:18.744 AEDT [9841] LOG:  !!>> tablesync worker: wait
for CATCHUP state notification
2020-11-17 10:46:18.744 AEDT [9838] LOG:  !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:46:18.744 AEDT [9838] LOG:  !!>> apply worker: called
process_syncing_tables

psql -d test_pub -c "INSERT INTO test_tab VALUES(1, 'foo');"
INSERT 0 1

[postgres@CentOS7-x64 ~]$ 2020-11-17 10:47:39.269 AEDT [9841] LOG:
!!>> tablesync worker: received CATCHUP state notification
2020-11-17 10:47:44.117 AEDT [9841] LOG:  !!>> tablesync worker:
Returned from LogicalRepSyncTableStart
2020-11-17 10:48:05.678 AEDT [10734] LOG:  starting logical decoding
for slot "tap_sub_24599_sync_16385"
2020-11-17 10:48:05.678 AEDT [10734] DETAIL:  Streaming transactions
committing after 0/1B6D618, reading WAL from 0/1B6D5E0.
2020-11-17 10:48:05.678 AEDT [10734] STATEMENT:  START_REPLICATION
SLOT "tap_sub_24599_sync_16385" LOGICAL 0/1B6D618 (proto_version '2',
publication_names '"tap_pub"')
2020-11-17 10:48:05.678 AEDT [10734] LOG:  logical decoding found
consistent point at 0/1B6D5E0
2020-11-17 10:48:05.678 AEDT [10734] DETAIL:  There are no running transactions.
2020-11-17 10:48:05.678 AEDT [10734] STATEMENT:  START_REPLICATION
SLOT "tap_sub_24599_sync_16385" LOGICAL 0/1B6D618 (proto_version '2',
publication_names '"tap_pub"')
2020-11-17 10:48:09.258 AEDT [9841] LOG:  !!>> tablesync worker:
LogicalRepApplyLoop
2020-11-17 10:49:36.537 AEDT [9841] LOG:  !!>> tablesync worker:
called process_syncing_tables
2020-11-17 10:49:36.538 AEDT [9841] LOG:  logical replication table
synchronization worker for subscription "tap_sub", table "test_tab"
has finished
2020-11-17 10:49:36.588 AEDT [9838] LOG:  !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:49:36.589 AEDT [9838] LOG:  !!>> apply worker: called
process_syncing_tables
2020-11-17 10:49:36.590 AEDT [9838] LOG:  !!>> apply worker: called
process_syncing_tables
2020-11-17 10:49:37.591 AEDT [9838] LOG:  !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:49:37.591 AEDT [9838] LOG:  !!>> apply worker: called
process_syncing_tables
2020-11-17 10:49:38.592 AEDT [9838] LOG:  !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:49:38.592 AEDT [9838] LOG:  !!>> apply worker: called
process_syncing_tables
2020-11-17 10:49:39.594 AEDT [9838] LOG:  !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:49:39.594 AEDT [9838] LOG:  !!>> apply worker: called
process_syncing_tables
2020-11-17 10:49:40.595 AEDT [9838] LOG:  !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:49:40.595 AEDT [9838] LOG:  !!>> apply worker: called
process_syncing_tables

---

I have so far only been trying above with the non-streaming
subscription, and TBH stepping through this startup state "dance" of
the tablesync/apply workers is already causing more questions than
answers for me. Anyway, I will raise any questions as separate emails
to this one.

BTW, do you think these tablesync discussions should be moved to
hackers list? I think they are no longer related to the reported BUG
of this current thread.

Kind Regards,
Peter Smith.
Fujitsu Australia

Вложения

В списке pgsql-bugs по дате отправления:

Предыдущее
От: PG Bug reporting form
Дата:
Сообщение: BUG #16722: PG hanging on COPY when table has close to 2^32 toasts in the table.
Следующее
От: Amit Kapila
Дата:
Сообщение: Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop