Re: Synchronizing slots from primary to standby
От | Masahiko Sawada |
---|---|
Тема | Re: Synchronizing slots from primary to standby |
Дата | |
Msg-id | CAD21AoCJJ86hyhH6C=udcLNnpbXA3uTVsjQBFi-887+niEpJ+g@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Synchronizing slots from primary to standby (Amit Kapila <amit.kapila16@gmail.com>) |
Ответы |
Re: Synchronizing slots from primary to standby
(Amit Kapila <amit.kapila16@gmail.com>)
RE: Synchronizing slots from primary to standby ("Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>) |
Список | pgsql-hackers |
On Wed, Apr 3, 2024 at 7:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Apr 3, 2024 at 11:13 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Apr 3, 2024 at 9:36 AM Bharath Rupireddy > > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > > I quickly looked at v8, and have a nit, rest all looks good. > > > > > > + if (DecodingContextReady(ctx) && found_consistent_snapshot) > > > + *found_consistent_snapshot = true; > > > > > > Can the found_consistent_snapshot be checked first to help avoid the > > > function call DecodingContextReady() for pg_replication_slot_advance > > > callers? > > > > > > > Okay, changed. Additionally, I have updated the comments and commit > > message. I'll push this patch after some more testing. > > > > Pushed! While testing this change, I realized that it could happen that the server logs are flooded with the following logical decoding logs that are written every 200 ms: 2024-04-04 16:15:19.270 JST [3838739] LOG: starting logical decoding for slot "test_sub" 2024-04-04 16:15:19.270 JST [3838739] DETAIL: Streaming transactions committing after 0/50006F48, reading WAL from 0/50006F10. 2024-04-04 16:15:19.270 JST [3838739] LOG: logical decoding found consistent point at 0/50006F10 2024-04-04 16:15:19.270 JST [3838739] DETAIL: There are no running transactions. 2024-04-04 16:15:19.477 JST [3838739] LOG: starting logical decoding for slot "test_sub" 2024-04-04 16:15:19.477 JST [3838739] DETAIL: Streaming transactions committing after 0/50006F48, reading WAL from 0/50006F10. 2024-04-04 16:15:19.477 JST [3838739] LOG: logical decoding found consistent point at 0/50006F10 2024-04-04 16:15:19.477 JST [3838739] DETAIL: There are no running transactions. For example, I could reproduce it with the following steps: 1. create the primary and start. 2. run "pgbench -i -s 100" on the primary. 3. run pg_basebackup to create the standby. 4. configure slotsync setup on the standby and start. 5. create a publication for all tables on the primary. 6. create the subscriber and start. 7. run "pgbench -i -Idtpf" on the subscriber. 8. create a subscription on the subscriber (initial data copy will start). The logical decoding logs were written every 200 ms during the initial data synchronization. Looking at the new changes for update_local_synced_slot(): if (remote_slot->confirmed_lsn != slot->data.confirmed_flush || remote_slot->restart_lsn != slot->data.restart_lsn || remote_slot->catalog_xmin != slot->data.catalog_xmin) { /* * We can't directly copy the remote slot's LSN or xmin unless there * exists a consistent snapshot at that point. Otherwise, after * promotion, the slots may not reach a consistent point before the * confirmed_flush_lsn which can lead to a data loss. To avoid data * loss, we let slot machinery advance the slot which ensures that * snapbuilder/slot statuses are updated properly. */ if (SnapBuildSnapshotExists(remote_slot->restart_lsn)) { /* * Update the slot info directly if there is a serialized snapshot * at the restart_lsn, as the slot can quickly reach consistency * at restart_lsn by restoring the snapshot. */ SpinLockAcquire(&slot->mutex); slot->data.restart_lsn = remote_slot->restart_lsn; slot->data.confirmed_flush = remote_slot->confirmed_lsn; slot->data.catalog_xmin = remote_slot->catalog_xmin; slot->effective_catalog_xmin = remote_slot->catalog_xmin; SpinLockRelease(&slot->mutex); if (found_consistent_snapshot) *found_consistent_snapshot = true; } else { LogicalSlotAdvanceAndCheckSnapState(remote_slot->confirmed_lsn, found_consistent_snapshot); } ReplicationSlotsComputeRequiredXmin(false); ReplicationSlotsComputeRequiredLSN(); slot_updated = true; We call LogicalSlotAdvanceAndCheckSnapState() if one of confirmed_lsn, restart_lsn, and catalog_xmin is different between the remote slot and the local slot. In my test case, during the initial sync performing, only catalog_xmin was different and there was no serialized snapshot at restart_lsn, and the slotsync worker called LogicalSlotAdvanceAndCheckSnapState(). However no slot properties were changed even after the function and it set slot_updated = true. So it starts the next slot synchronization after 200ms. It seems to me that we can skip calling LogicalSlotAdvanceAndCheckSnapState() at least when the remote and local have the same restart_lsn and confirmed_lsn. I'm not sure there are other scenarios but is it worth fixing this symptom? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
В списке pgsql-hackers по дате отправления: