Re: Synchronizing slots from primary to standby
| От | Masahiko Sawada |
|---|---|
| Тема | Re: Synchronizing slots from primary to standby |
| Дата | |
| Msg-id | CAD21AoCJJ86hyhH6C=udcLNnpbXA3uTVsjQBFi-887+niEpJ+g@mail.gmail.com обсуждение исходный текст |
| Ответ на | Re: Synchronizing slots from primary to standby (Amit Kapila <amit.kapila16@gmail.com>) |
| Ответы |
Re: Synchronizing slots from primary to standby
RE: Synchronizing slots from primary to standby |
| Список | pgsql-hackers |
On Wed, Apr 3, 2024 at 7:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Apr 3, 2024 at 11:13 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Apr 3, 2024 at 9:36 AM Bharath Rupireddy
> > <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > > I quickly looked at v8, and have a nit, rest all looks good.
> > >
> > > + if (DecodingContextReady(ctx) && found_consistent_snapshot)
> > > + *found_consistent_snapshot = true;
> > >
> > > Can the found_consistent_snapshot be checked first to help avoid the
> > > function call DecodingContextReady() for pg_replication_slot_advance
> > > callers?
> > >
> >
> > Okay, changed. Additionally, I have updated the comments and commit
> > message. I'll push this patch after some more testing.
> >
>
> Pushed!
While testing this change, I realized that it could happen that the
server logs are flooded with the following logical decoding logs that
are written every 200 ms:
2024-04-04 16:15:19.270 JST [3838739] LOG: starting logical decoding
for slot "test_sub"
2024-04-04 16:15:19.270 JST [3838739] DETAIL: Streaming transactions
committing after 0/50006F48, reading WAL from 0/50006F10.
2024-04-04 16:15:19.270 JST [3838739] LOG: logical decoding found
consistent point at 0/50006F10
2024-04-04 16:15:19.270 JST [3838739] DETAIL: There are no running
transactions.
2024-04-04 16:15:19.477 JST [3838739] LOG: starting logical decoding
for slot "test_sub"
2024-04-04 16:15:19.477 JST [3838739] DETAIL: Streaming transactions
committing after 0/50006F48, reading WAL from 0/50006F10.
2024-04-04 16:15:19.477 JST [3838739] LOG: logical decoding found
consistent point at 0/50006F10
2024-04-04 16:15:19.477 JST [3838739] DETAIL: There are no running
transactions.
For example, I could reproduce it with the following steps:
1. create the primary and start.
2. run "pgbench -i -s 100" on the primary.
3. run pg_basebackup to create the standby.
4. configure slotsync setup on the standby and start.
5. create a publication for all tables on the primary.
6. create the subscriber and start.
7. run "pgbench -i -Idtpf" on the subscriber.
8. create a subscription on the subscriber (initial data copy will start).
The logical decoding logs were written every 200 ms during the initial
data synchronization.
Looking at the new changes for update_local_synced_slot():
if (remote_slot->confirmed_lsn != slot->data.confirmed_flush ||
remote_slot->restart_lsn != slot->data.restart_lsn ||
remote_slot->catalog_xmin != slot->data.catalog_xmin)
{
/*
* We can't directly copy the remote slot's LSN or xmin unless there
* exists a consistent snapshot at that point. Otherwise, after
* promotion, the slots may not reach a consistent point before the
* confirmed_flush_lsn which can lead to a data loss. To avoid data
* loss, we let slot machinery advance the slot which ensures that
* snapbuilder/slot statuses are updated properly.
*/
if (SnapBuildSnapshotExists(remote_slot->restart_lsn))
{
/*
* Update the slot info directly if there is a serialized snapshot
* at the restart_lsn, as the slot can quickly reach consistency
* at restart_lsn by restoring the snapshot.
*/
SpinLockAcquire(&slot->mutex);
slot->data.restart_lsn = remote_slot->restart_lsn;
slot->data.confirmed_flush = remote_slot->confirmed_lsn;
slot->data.catalog_xmin = remote_slot->catalog_xmin;
slot->effective_catalog_xmin = remote_slot->catalog_xmin;
SpinLockRelease(&slot->mutex);
if (found_consistent_snapshot)
*found_consistent_snapshot = true;
}
else
{
LogicalSlotAdvanceAndCheckSnapState(remote_slot->confirmed_lsn,
found_consistent_snapshot);
}
ReplicationSlotsComputeRequiredXmin(false);
ReplicationSlotsComputeRequiredLSN();
slot_updated = true;
We call LogicalSlotAdvanceAndCheckSnapState() if one of confirmed_lsn,
restart_lsn, and catalog_xmin is different between the remote slot and
the local slot. In my test case, during the initial sync performing,
only catalog_xmin was different and there was no serialized snapshot
at restart_lsn, and the slotsync worker called
LogicalSlotAdvanceAndCheckSnapState(). However no slot properties were
changed even after the function and it set slot_updated = true. So it
starts the next slot synchronization after 200ms.
It seems to me that we can skip calling
LogicalSlotAdvanceAndCheckSnapState() at least when the remote and
local have the same restart_lsn and confirmed_lsn.
I'm not sure there are other scenarios but is it worth fixing this symptom?
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
В списке pgsql-hackers по дате отправления: