Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop
От | Peter Smith |
---|---|
Тема | Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop |
Дата | |
Msg-id | CAHut+Psprtsa4o89wtNnKLxxwXeDKAX9nNsdghT1Pv63siz+AA@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop (Amit Kapila <amit.kapila16@gmail.com>) |
Ответы |
Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop
(Amit Kapila <amit.kapila16@gmail.com>)
|
Список | pgsql-bugs |
On Tue, Nov 17, 2020 at 1:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > Yeah, this seems to be possible and this is the reason I mentioned > above to dig more into this case. Did you try it via some test case? I > think we can generate a test via debugger where after the tablesync > worker reaches CATCHUP state stop it via debugger, then we can perform > some large transaction on the same table which apply worker will skip > and tablesync worker will try to apply changes and should fail. Hello Amit. FYI - This email is just to confirm that your above idea for debugging the tablesync worker does work. I have made a small temporary patch to aid testing this: PSA. The patch just adds some more LOGs - it helps see what is going on. The patch also gives a 30 second opportunity to attach to the tablesync worker process. ---- So the necessary debugging steps are like this: 0. Start debugger - Later we will attach to the tablesync worker process so it is convenient to set all the breakpoint in advance (gdb) b LogicalRepSyncTableStart (gdb) b LogicalRepApplyLoop (gdb) b apply_dispatch (gdb) b process_syncing_tables_for_sync 1. Pub/Sub nodes: Create/Start Publisher and Subscriber nodes with same test table. 2. Pub node: Add some initial data to the table (to give tablesync init something to copy) 3. Pub node: CREATE PUBLISHER 4. Sub node: CREATE SUBSCRIBER - at this point the slot gets created and the tablesync worker and apply worker are launched - that patch causes the tablesync worker to sleep 30 seconds to give opportunity to attach to it in debugger 5. Debugger/tablesync: Attach to the tablesync worker - continue - breaks at LogicalRepSyncTableStart - step over all the initial COPY code to the end of the function (where apply worker has told it to CATCHUP) 6. Pub: While the tablesync worker is paused in debugger do some other operations on the original table - e.g. psql -d test_pub -c "INSERT INTO test_tab VALUES(1, 'foo');" 7. Debugger/tablesync: Allow the tablesync worker to continue - tablesync worker will return from LogicalRepSyncTableStart and then into LogicalRepApplyLoop - if tablesync worker detects it is "behind" then the LogicalRepApplyLoop will call apply_dispatch to deal with the message resulting from the step 6 operation. 8. Debugger/tablesync: Some apply handlers (e.g. apply_handle_commit) if not the LogicalRepApplyLoop itself, will end up calling the process_syncing_tables. - which calls process_syncing_tables_for_sync - which will cause the tablesync worker to set SYNCDONE status just before killing itself. 9. Sub node: The apply worker wait loop will now proceed again tick/tick/tick every 1 second to process any incoming replication messages. --- Using the above technique, you can cause a tablesync worker to enter some of the normal "apply" handling which might otherwise be very hard to reproduce outside of the debugger. --- Here is an example of some of the logging for the above debugging scenario: 2020-11-17 10:46:11.325 AEDT [9838] LOG: !!>> apply worker: called process_syncing_tables 2020-11-17 10:46:12.328 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-17 10:46:12.328 AEDT [9838] LOG: !!>> apply worker: called process_syncing_tables 2020-11-17 10:46:13.337 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-17 10:46:13.337 AEDT [9838] LOG: !!>> apply worker: called process_syncing_tables 2020-11-17 10:46:14.340 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-17 10:46:14.340 AEDT [9838] LOG: !!>> apply worker: called process_syncing_tables 2020-11-17 10:46:15.155 AEDT [9841] LOG: !!>> tablesync worker: About to call LogicalRepSyncTableStart to do initial syncing 2020-11-17 10:46:15.343 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-17 10:46:15.343 AEDT [9838] LOG: !!>> apply worker: called process_syncing_tables 2020-11-17 10:46:15.343 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-17 10:46:15.343 AEDT [9838] LOG: !!>> apply worker: called process_syncing_tables 2020-11-17 10:46:16.349 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-17 10:46:16.349 AEDT [9838] LOG: !!>> apply worker: called process_syncing_tables 2020-11-17 10:46:17.351 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-17 10:46:17.351 AEDT [9838] LOG: !!>> apply worker: called process_syncing_tables 2020-11-17 10:46:18.353 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-17 10:46:18.353 AEDT [9838] LOG: !!>> apply worker: called process_syncing_tables 2020-11-17 10:46:18.650 AEDT [10734] LOG: logical decoding found consistent point at 0/1B6D5E0 2020-11-17 10:46:18.650 AEDT [10734] DETAIL: There are no running transactions. 2020-11-17 10:46:18.650 AEDT [10734] STATEMENT: CREATE_REPLICATION_SLOT "tap_sub_24599_sync_16385" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT 2020-11-17 10:46:18.658 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-17 10:46:18.658 AEDT [9838] LOG: !!>> apply worker: called process_syncing_tables 2020-11-17 10:46:18.744 AEDT [9841] LOG: !!>> tablesync worker: wait for CATCHUP state notification 2020-11-17 10:46:18.744 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-17 10:46:18.744 AEDT [9838] LOG: !!>> apply worker: called process_syncing_tables psql -d test_pub -c "INSERT INTO test_tab VALUES(1, 'foo');" INSERT 0 1 [postgres@CentOS7-x64 ~]$ 2020-11-17 10:47:39.269 AEDT [9841] LOG: !!>> tablesync worker: received CATCHUP state notification 2020-11-17 10:47:44.117 AEDT [9841] LOG: !!>> tablesync worker: Returned from LogicalRepSyncTableStart 2020-11-17 10:48:05.678 AEDT [10734] LOG: starting logical decoding for slot "tap_sub_24599_sync_16385" 2020-11-17 10:48:05.678 AEDT [10734] DETAIL: Streaming transactions committing after 0/1B6D618, reading WAL from 0/1B6D5E0. 2020-11-17 10:48:05.678 AEDT [10734] STATEMENT: START_REPLICATION SLOT "tap_sub_24599_sync_16385" LOGICAL 0/1B6D618 (proto_version '2', publication_names '"tap_pub"') 2020-11-17 10:48:05.678 AEDT [10734] LOG: logical decoding found consistent point at 0/1B6D5E0 2020-11-17 10:48:05.678 AEDT [10734] DETAIL: There are no running transactions. 2020-11-17 10:48:05.678 AEDT [10734] STATEMENT: START_REPLICATION SLOT "tap_sub_24599_sync_16385" LOGICAL 0/1B6D618 (proto_version '2', publication_names '"tap_pub"') 2020-11-17 10:48:09.258 AEDT [9841] LOG: !!>> tablesync worker: LogicalRepApplyLoop 2020-11-17 10:49:36.537 AEDT [9841] LOG: !!>> tablesync worker: called process_syncing_tables 2020-11-17 10:49:36.538 AEDT [9841] LOG: logical replication table synchronization worker for subscription "tap_sub", table "test_tab" has finished 2020-11-17 10:49:36.588 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-17 10:49:36.589 AEDT [9838] LOG: !!>> apply worker: called process_syncing_tables 2020-11-17 10:49:36.590 AEDT [9838] LOG: !!>> apply worker: called process_syncing_tables 2020-11-17 10:49:37.591 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-17 10:49:37.591 AEDT [9838] LOG: !!>> apply worker: called process_syncing_tables 2020-11-17 10:49:38.592 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-17 10:49:38.592 AEDT [9838] LOG: !!>> apply worker: called process_syncing_tables 2020-11-17 10:49:39.594 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-17 10:49:39.594 AEDT [9838] LOG: !!>> apply worker: called process_syncing_tables 2020-11-17 10:49:40.595 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-17 10:49:40.595 AEDT [9838] LOG: !!>> apply worker: called process_syncing_tables --- I have so far only been trying above with the non-streaming subscription, and TBH stepping through this startup state "dance" of the tablesync/apply workers is already causing more questions than answers for me. Anyway, I will raise any questions as separate emails to this one. BTW, do you think these tablesync discussions should be moved to hackers list? I think they are no longer related to the reported BUG of this current thread. Kind Regards, Peter Smith. Fujitsu Australia
Вложения
В списке pgsql-bugs по дате отправления:
Предыдущее
От: PG Bug reporting formДата:
Сообщение: BUG #16722: PG hanging on COPY when table has close to 2^32 toasts in the table.
Следующее
От: Amit KapilaДата:
Сообщение: Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop