Re: Single transaction in the tablesync worker?

Поиск
Список
Период
Сортировка
От Peter Smith
Тема Re: Single transaction in the tablesync worker?
Дата
Msg-id CAHut+PuT=vp9uDTnpRNLYt+LhrB5XFcKroVSgfcBfW7hj_e3_w@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Single transaction in the tablesync worker?  (Amit Kapila <amit.kapila16@gmail.com>)
Ответы Re: Single transaction in the tablesync worker?  (Amit Kapila <amit.kapila16@gmail.com>)
Список pgsql-hackers
On Mon, Dec 21, 2020 at 11:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Dec 21, 2020 at 3:17 PM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > On Mon, Dec 21, 2020 at 4:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > Few other comments:
> > > ==================
> >
> > Thanks for your feedback.
> >
> > > 1.
> > > * FIXME 3 - Crashed tablesync workers may also have remaining slots
> > > because I don't think
> > > + * such workers are even iterated by this loop, and nobody else is
> > > removing them.
> > > + */
> > > + if (slotname)
> > > + {
> > >
> > > The above FIXME is not clear to me. Actually, the crashed workers
> > > should restart, finish their work, and drop the slots. So not sure
> > > what exactly this FIXME refers to?
> >
> > Yes, normally if the tablesync can complete it should behave like that.
> > But I think there are other scenarios where it may be unable to
> > clean-up after itself. For example:
> >
> > i) Maybe the crashed tablesync worker cannot finish. e.g. A row insert
> > handled by tablesync can give a PK violation which also will crash
> > again and again for each re-launched/replacement tablesync worker.
> > This can be reproduced in the debugger. If the DropSubscription
> > doesn't clean-up the tablesync's slot then nobody will.
> >
> > ii) Also DROP SUBSCRIPTION code has locking (see code commit) "to
> > ensure that the launcher doesn't restart new worker during dropping
> > the subscription".
> >
>
> Yeah, I have also read that comment but do you know how it is
> preventing relaunch? How does the subscription lock help?

Hmmm. I did see there is a matching lock in get_subscription_list of
launcher.c, which may be what that code comment was referring to. But
I also am currently unsure how this lock prevents anybody (e.g.
process_syncing_tables_for_apply) from executing another
logicalrep_worker_launch.

>
> > So executing DROP SUBSCRIPTION will prevent a newly
> > crashed tablesync from re-launching, so it won’t be able to take care
> > of its own slot. If the DropSubscription doesn't clean-up that
> > tablesync's slot then nobody will.
> >
>
>
> > >
> > > 2.
> > > DropSubscription()
> > > {
> > > ..
> > > ReplicationSlotDropAtPubNode(
> > > + NULL,
> > > + conninfo, /* use conninfo to make a new connection. */
> > > + subname,
> > > + syncslotname);
> > > ..
> > > }
> > >
> > > With the above call, it will form a connection with the publisher and
> > > drop the required slots. I think we need to save the connection info
> > > so that we don't need to connect/disconnect for each slot to be
> > > dropped. Later in this function, we again connect and drop the apply
> > > worker slot. I think we should connect just once drop the apply and
> > > table sync slots if any.
> >
> > OK. IIUC this is a suggestion for more efficient connection usage,
> > rather than actual bug right?
> >
>
> Yes, it is for effective connection usage.
>

I have addressed this in the latest patch [v6]

> >
> > >
> > > 3.
> > > ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn_given, char
> > > *conninfo, char *subname, char *slotname)
> > > {
> > > ..
> > > + PG_TRY();
> > > ..
> > > + PG_CATCH();
> > > + {
> > > + /* NOP. Just gobble any ERROR. */
> > > + }
> > > + PG_END_TRY();
> > >
> > > Why are we suppressing the error instead of handling it the error in
> > > the same way as we do while dropping the apply worker slot in
> > > DropSubscription?
> >
> > This function is common - it is also called from the tablesync
> > finish_sync_worker. But in the finish_sync_worker case I wanted to
> > avoid throwing an ERROR which would cause the tablesync to crash and
> > relaunch (and crash/relaunch/repeat...) when all it was trying to do
> > in the first place was just cleanup and exit the process. Perhaps the
> > error suppression should be conditional depending where this function
> > is called from?
> >
>
> Yeah, that could be one way and if you follow my previous suggestion
> this function might change a bit more.

I have addressed this in the latest patch [v6]

---
[v6] https://www.postgresql.org/message-id/CAHut%2BPuCLty2HGNT6neyOcUmBNxOLo%3DybQ2Yv-nTR4kFY-8QLw%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia.



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Daniil Zakhlystov
Дата:
Сообщение: Re: libpq compression
Следующее
От: Fujii Masao
Дата:
Сообщение: Re: Deadlock between backend and recovery may not be detected