Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

Поиск
Список
Период
Сортировка
От Peter Smith
Тема Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop
Дата
Msg-id CAHut+PuCjbJkEqNc1GxK0CSA0w_tDR4-b9vdbUGPYFB8F9fPKg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop  (Amit Kapila <amit.kapila16@gmail.com>)
Ответы Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop  (Amit Kapila <amit.kapila16@gmail.com>)
Список pgsql-bugs
On Wed, Nov 25, 2020 at 7:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I thought there should be a couple small
> > changes made for this patch, as follows.
> >
> > (1) process_sync_tables should be last
> >
> > IMO the process_syncing_tables should be always the *last* thing any
> > apply handler does, because it seems a bad idea for the worker to
> > change state (e.g. to say it is SUBREL_STATE_SYNCDONE) if in fact
> > there is still a little bit more processing remaining.
> >
>
> Hmm, what makes you think it is a bad idea, is there any comment which
> makes you feel so? As far as I understand, the only thing tablesync
> worker needs to ensure before reaching that state is it should apply
> till the required lsn which is done by the time it is called in the
> patch. I think doing it sooner is better because it will let apply
> worker start its work. In any case, this is not related to this
> bug-fix patch, so, if you want to build a case for doing it later then
> we can discuss it separately.

Yes, this review issue is independent of the initial streaming bug the
patch is solving. But by refactoring this exact line of code into a
new function apply_handle_commit_internal() I felt the patch is kind
of now taking ownership of it - so it is not *really* unrelated to
this patch anymore...

(However if you prefer to treat this review item as a separate issue
it is fine by me).

AFAIK the process_syncing_tables function acts like the main handshake
mechanism between the "tablesync" and "apply" workers.
e.g. From "tablesync" worker POV - am I finished? Should I tell the
apply worker I am DONE? Is it time to exit?
e.g. From "apply" worker POV - have the tablesyncs caught up yet? Can I proceed?

I do think these handshake functions ought to be consistently located
(e.g always kept last in the apply handler functions where possible).

It is true, a slightly different placement (e.g. like current code)
still passes the tests, but I see more problems than benefits to keep
it that way. IIUC for the tablesync worker to be executing some of
these handlers at all is a quite a rare occurrence caused by unusual
timing. I think putting the process_syncing_tables() call earlier with
the objective to allow apply worker to proceed a few ms earlier (for
the already rare case) is not worth it.

OTOH, putting the handshake function consistently last does have benefits:
- the consistent code is easier to read
- none of the tablesync timing stuff is very easy to test (certainly
not with automated regression tests). Keeping code consistent means
less risk of introducing some unforeseen/untestable issue
- that stream_cleanup_files(...) called by
apply_handle_stream_commit() is supposed to be cleaning up file
resources. IIUC by allowing the tablesync worker to call
process_syncing_tables() before this cleanup it means the tablesync
worker may decide that it is SUBREL_STATE_SYNCDONE and then just exit
the process! So in this scenario that file resource cleanup will never
get a chance to happen at all. Isn't that just a plain bug?

>
> >
> > (2) misleading comment
> >
> > /*
> >  * Start a transaction on stream start, this transaction will be committed
> >  * on the stream stop. We need the transaction for handling the buffile,
> >  * used for serializing the streaming data and subxact info.
> >  */
> >
> > That above comment (in the apply_handle_stream_start function) may now
> > be inaccurate/misleading for the case of tablesync worker, because I
> > think for tablesync you are explicitly avoiding the tx commit within
> > the apply_handle_stream_stop().
> >
>
> Okay, I think we can slightly adjust the comment as: "Start a
> transaction on stream start, this transaction will be committed on the
> stream stop unless it is a tablesync worker in which case it will be
> committed after processing all the messages. We need the transaction
> for handling the buffile, used for serializing the streaming data and
> subxact info.".

The reworded comment looks ok now.

---

Kind Regards,
Peter Smith.
Fujitsu Australia



В списке pgsql-bugs по дате отправления:

Предыдущее
От: Christian
Дата:
Сообщение: Re: BUG #16745: delete does not prune partitions on declarative partitioned table
Следующее
От: Amit Kapila
Дата:
Сообщение: Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop