RE: Synchronizing slots from primary to standby

Поиск
Список
Период
Сортировка
От Zhijie Hou (Fujitsu)
Тема RE: Synchronizing slots from primary to standby
Дата
Msg-id OS0PR01MB571684A9D81F47CD1361A26B945E2@OS0PR01MB5716.jpnprd01.prod.outlook.com
обсуждение исходный текст
Ответ на Re: Synchronizing slots from primary to standby  (Amit Kapila <amit.kapila16@gmail.com>)
Список pgsql-hackers
On Thursday, February 29, 2024 5:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Thu, Feb 29, 2024 at 8:29 AM Peter Smith <smithpb2250@gmail.com>
> wrote:
> >
> > On Wed, Feb 28, 2024 at 1:23 PM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > On Tuesday, February 27, 2024 3:18 PM Peter Smith
> <smithpb2250@gmail.com> wrote:
> > ...
> > > > 20.
> > > > +#
> > > > +# | ----> standby1 (primary_slot_name = sb1_slot) # | ----> standby2
> > > > +(primary_slot_name = sb2_slot) # primary ----- | # | ----> subscriber1
> > > > +(failover = true) # | ----> subscriber2 (failover = false)
> > > >
> > > > In the diagram, the "--->" means a mixture of physical standbys and
> logical
> > > > pub/sub replication. Maybe it can be a bit clearer?
> > > >
> > > > SUGGESTION
> > > >
> > > > # primary (publisher)
> > > > #
> > > > #     (physical standbys)
> > > > #     | ----> standby1 (primary_slot_name = sb1_slot)
> > > > #     | ----> standby2 (primary_slot_name = sb2_slot)
> > > > #
> > > > #     (logical replication)
> > > > #     | ----> subscriber1 (failover = true, slot_name = lsub1_slot)
> > > > #     | ----> subscriber2 (failover = false, slot_name = lsub2_slot)
> > > >
> > >
> > > I think one can distinguish it based on the 'standby' and 'subscriber' as well,
> because
> > > 'standby' normally refer to physical standby while the other refer to logical.
> > >
> 
> I think Peter's suggestion will make the setup clear.

Changed.

> 
> >
> > Ok, but shouldn't it at least include info about the logical slot
> > names associated with the subscribers (slot_name = lsub1_slot,
> > slot_name = lsub2_slot) like suggested above?
> >
> > ======
> >
> > Here are some more review comments for v100-0001
> >
> > ======
> > doc/src/sgml/config.sgml
> >
> > 1.
> > +       <para>
> > +        Lists the streaming replication standby server slot names that
> logical
> > +        WAL sender processes will wait for. Logical WAL sender processes
> will
> > +        send decoded changes to plugins only after the specified
> replication
> > +        slots confirm receiving WAL. This guarantees that logical replication
> > +        slots with failover enabled do not consume changes until those
> changes
> > +        are received and flushed to corresponding physical standbys. If a
> > +        logical replication connection is meant to switch to a physical
> standby
> > +        after the standby is promoted, the physical replication slot for the
> > +        standby should be listed here. Note that logical replication will not
> > +        proceed if the slots specified in the standby_slot_names do
> > not exist or
> > +        are invalidated.
> > +       </para>
> >
> > Is that note ("Note that logical replication will not proceed if the
> > slots specified in the standby_slot_names do not exist or are
> > invalidated") meant only for subscriptions marked for 'failover' or
> > any subscription? Maybe wording can be modified to help clarify it?
> >
> 
> I think it is implicit that here we are talking about failover slots.
> I think clarifying again the same could be repetitive considering the
> previous sentence: "This guarantees that logical replication slots
> with failover enabled do not consume .." have mentioned it.
> 
> > ======
> > src/backend/replication/slot.c
> >
> > 2.
> > +/*
> > + * A helper function to validate slots specified in GUC standby_slot_names.
> > + */
> > +static bool
> > +validate_standby_slots(char **newval)
> > +{
> > + char    *rawname;
> > + List    *elemlist;
> > + bool ok;
> > +
> > + /* Need a modifiable copy of string */
> > + rawname = pstrdup(*newval);
> > +
> > + /* Verify syntax and parse string into a list of identifiers */
> > + ok = SplitIdentifierString(rawname, ',', &elemlist);
> > +
> > + if (!ok)
> > + {
> > + GUC_check_errdetail("List syntax is invalid.");
> > + }
> > +
> > + /*
> > + * If the replication slots' data have been initialized, verify if the
> > + * specified slots exist and are logical slots.
> > + */
> > + else if (ReplicationSlotCtl)
> > + {
> > + foreach_ptr(char, name, elemlist)
> > + {
> > + ReplicationSlot *slot;
> > +
> > + slot = SearchNamedReplicationSlot(name, true);
> > +
> > + if (!slot)
> > + {
> > + GUC_check_errdetail("replication slot \"%s\" does not exist",
> > + name);
> > + ok = false;
> > + break;
> > + }
> > +
> > + if (!SlotIsPhysical(slot))
> > + {
> > + GUC_check_errdetail("\"%s\" is not a physical replication slot",
> > + name);
> > + ok = false;
> > + break;
> > + }
> > + }
> > + }
> > +
> > + pfree(rawname);
> > + list_free(elemlist);
> > + return ok;
> > +}
> >
> > 2a.
> > I didn't mention this previously because I thought this function was
> > not going to change anymore, but since Bertrand suggested some changes
> > [1], I will say IMO the { } are fine here for the single statement,
> > but I think it will be better to rearrange this code to be like below.
> > Having a 2nd NOP 'else' gives a much better place where you can put
> > your ReplicationSlotCtl comment.
> >
> > if (!ok)
> > {
> >   GUC_check_errdetail("List syntax is invalid.");
> > }
> > else if (!ReplicationSlotCtl)
> > {
> >   <Insert big comment here about why it is OK to skip when
> > ReplicationSlotCtl is NULL>
> > }
> > else
> > {
> >   foreach_ptr ...
> > }
> >
> 
> +1. This will make the code and reasoning to skip clear.

Changed.

> 
> Few additional comments on the latest patch:
> =================================
> 1.
>  static XLogRecPtr
>  WalSndWaitForWal(XLogRecPtr loc)
> {
> ...
> + if (!XLogRecPtrIsInvalid(RecentFlushPtr) && loc <= RecentFlushPtr &&
> + (!replication_active || StandbyConfirmedFlush(loc, WARNING)))
> + {
> + /*
> + * Fast path to avoid acquiring the spinlock in case we already know
> + * we have enough WAL available and all the standby servers have
> + * confirmed receipt of WAL up to RecentFlushPtr. This is particularly
> + * interesting if we're far behind.
> + */
>   return RecentFlushPtr;
> + }
> ...
> ...
> + * Wait for WALs to be flushed to disk only if the postmaster has not
> + * asked us to stop.
> + */
> + if (loc > RecentFlushPtr && !got_STOPPING)
> + wait_event = WAIT_EVENT_WAL_SENDER_WAIT_FOR_WAL;
> +
> + /*
> + * Check if the standby slots have caught up to the flushed position.
> + * It is good to wait up to RecentFlushPtr and then let it send the
> + * changes to logical subscribers one by one which are already covered
> + * in RecentFlushPtr without needing to wait on every change for
> + * standby confirmation. Note that after receiving the shutdown signal,
> + * an ERROR is reported if any slots are dropped, invalidated, or
> + * inactive. This measure is taken to prevent the walsender from
> + * waiting indefinitely.
> + */
> + else if (replication_active &&
> + !StandbyConfirmedFlush(RecentFlushPtr, WARNING))
> + {
> + wait_event = WAIT_EVENT_WAIT_FOR_STANDBY_CONFIRMATION;
> + wait_for_standby = true;
> + }
> + else
> + {
> + /* Already caught up and doesn't need to wait for standby_slots. */
>   break;
> + }
> ...
> }
> 
> Can we try to move these checks into a separate function that returns
> a boolean and has an out parameter as wait_event?

Refactored.

> 
> 2. How about naming StandbyConfirmedFlush() as StandbySlotsAreCaughtup?

I used a similar version based on the suggested name: StandbySlotsHaveCaughtup.

Best Regards,
Hou zj



В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Zhijie Hou (Fujitsu)"
Дата:
Сообщение: RE: Synchronizing slots from primary to standby
Следующее
От: Kyotaro Horiguchi
Дата:
Сообщение: Re: Infinite loop in XLogPageRead() on standby