Re: Synchronizing slots from primary to standby

Поиск
Список
Период
Сортировка
От shveta malik
Тема Re: Synchronizing slots from primary to standby
Дата
Msg-id CAJpy0uCG7=OwVZ03xd0855mdxFFP-ahwjfV+Cvrisgt2JyNb3w@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Synchronizing slots from primary to standby  (Amit Kapila <amit.kapila16@gmail.com>)
Ответы Re: Synchronizing slots from primary to standby  ("Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>)
Список pgsql-hackers
On Thu, Nov 9, 2023 at 8:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Nov 9, 2023 at 8:11 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Nov 8, 2023 at 8:09 PM Drouvot, Bertrand
> > <bertranddrouvot.pg@gmail.com> wrote:
> > >
> > > > Unrelated to above, if there is a user slot on standby with the same
> > > > name which the slot-sync worker is trying to create, then shall it
> > > > emit a warning and skip the sync of that slot or shall it throw an
> > > > error?
> > > >
> > >
> > > I'd vote for emit a warning and move on to the next slot if any.
> > >
> >
> > But then it could take time for users to know the actual problem and
> > they probably notice it after failover. OTOH, if we throw an error
> > then probably they will come to know earlier because the slot sync
> > mechanism would be stopped. Do you have reasons to prefer giving a
> > WARNING and skipping creating such slots? I expect this WARNING to
> > keep getting repeated in LOGs because the consecutive sync tries will
> > again generate a WARNING.
> >
>
> Apart from the above, I would like to discuss the slot sync work
> distribution strategy of this patch. The current implementation as
> explained in the commit message [1] works well if the slots belong to
> multiple databases. It is clear from the data in emails [2][3][4] that
> having more workers really helps if the slots belong to multiple
> databases. But I think if all the slots belong to one or very few
> databases then such a strategy won't be as good. Now, on one hand, we
> get very good numbers for a particular workload with the strategy used
> in the patch but OTOH it may not be adaptable to various different
> kinds of workloads. So, I have a question whether we should try to
> optimize this strategy for various kinds of workloads or for the first
> version let's use a single-slot sync-worker and then we can enhance
> the functionality in later patches either in PG17 itself or in PG18 or
> later versions.

I can work on separating the patch. We can first focus on single
worker design and then we can work on multi-worker design either
immediately (if needed) or we can target it in the second draft of the
patch. I would like to know the thoughts of others on this.

One thing to note is that a lot of the complexity of
> the patch is attributed to the multi-worker strategy which may still
> not be efficient, so there is an argument to go with a simpler
> single-slot sync-worker strategy and then enhance it in future
> versions as we learn more about various workloads. It will also help
> to develop this feature incrementally instead of doing all the things
> in one go and taking a much longer time than it should.
>

Agreed. With multi-workers, a lot of complexity (dsa, locks etc) have
come into play. We can decide better on our workload distribution
strategy among workers once we have more clarity on different types of
workloads.

>
> [1] - "The replication launcher on the physical standby queries
> primary to get the list of dbids for failover logical slots. Once it
> gets the dbids, if dbids < max_slotsync_workers, it starts only that
> many workers, and if dbids > max_slotsync_workers, it starts
> max_slotsync_workers and divides the work equally among them. Each
> worker is then responsible to keep on syncing the logical slots
> belonging to the DBs assigned to it.
>
> Each slot-sync worker will have its own dbids list. Since the upper
> limit of this dbid-count is not known, it needs to be handled using
> dsa. We initially allocated memory to hold 100 dbids for each worker.
> If this limit is exhausted, we reallocate this memory with size
> incremented again by 100."
>
> [2] - https://www.postgresql.org/message-id/CAJpy0uD2F43avuXy_yQv7Wa3kpUwioY_Xn955xdmd6vX0ME6%3Dg%40mail.gmail.com
> [3] - https://www.postgresql.org/message-id/CAFPTHDZw2G3Pax0smymMjfPqdPcZhMWo36f9F%2BTwNTs0HFxK%2Bw%40mail.gmail.com
> [4] - https://www.postgresql.org/message-id/CAJpy0uD%3DDevMxTwFVsk_%3DxHqYNH8heptwgW6AimQ9fbRmx4ioQ%40mail.gmail.com
>
> --
> With Regards,
> Amit Kapila.



В списке pgsql-hackers по дате отправления:

Предыдущее
От: torikoshia
Дата:
Сообщение: Re: Add new option 'all' to pg_stat_reset_shared()
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: Add recovery to pg_control and remove backup_label