Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop
От | Peter Smith |
---|---|
Тема | Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop |
Дата | |
Msg-id | CAHut+PuCjbJkEqNc1GxK0CSA0w_tDR4-b9vdbUGPYFB8F9fPKg@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop (Amit Kapila <amit.kapila16@gmail.com>) |
Ответы |
Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop
(Amit Kapila <amit.kapila16@gmail.com>)
|
Список | pgsql-bugs |
On Wed, Nov 25, 2020 at 7:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I thought there should be a couple small > > changes made for this patch, as follows. > > > > (1) process_sync_tables should be last > > > > IMO the process_syncing_tables should be always the *last* thing any > > apply handler does, because it seems a bad idea for the worker to > > change state (e.g. to say it is SUBREL_STATE_SYNCDONE) if in fact > > there is still a little bit more processing remaining. > > > > Hmm, what makes you think it is a bad idea, is there any comment which > makes you feel so? As far as I understand, the only thing tablesync > worker needs to ensure before reaching that state is it should apply > till the required lsn which is done by the time it is called in the > patch. I think doing it sooner is better because it will let apply > worker start its work. In any case, this is not related to this > bug-fix patch, so, if you want to build a case for doing it later then > we can discuss it separately. Yes, this review issue is independent of the initial streaming bug the patch is solving. But by refactoring this exact line of code into a new function apply_handle_commit_internal() I felt the patch is kind of now taking ownership of it - so it is not *really* unrelated to this patch anymore... (However if you prefer to treat this review item as a separate issue it is fine by me). AFAIK the process_syncing_tables function acts like the main handshake mechanism between the "tablesync" and "apply" workers. e.g. From "tablesync" worker POV - am I finished? Should I tell the apply worker I am DONE? Is it time to exit? e.g. From "apply" worker POV - have the tablesyncs caught up yet? Can I proceed? I do think these handshake functions ought to be consistently located (e.g always kept last in the apply handler functions where possible). It is true, a slightly different placement (e.g. like current code) still passes the tests, but I see more problems than benefits to keep it that way. IIUC for the tablesync worker to be executing some of these handlers at all is a quite a rare occurrence caused by unusual timing. I think putting the process_syncing_tables() call earlier with the objective to allow apply worker to proceed a few ms earlier (for the already rare case) is not worth it. OTOH, putting the handshake function consistently last does have benefits: - the consistent code is easier to read - none of the tablesync timing stuff is very easy to test (certainly not with automated regression tests). Keeping code consistent means less risk of introducing some unforeseen/untestable issue - that stream_cleanup_files(...) called by apply_handle_stream_commit() is supposed to be cleaning up file resources. IIUC by allowing the tablesync worker to call process_syncing_tables() before this cleanup it means the tablesync worker may decide that it is SUBREL_STATE_SYNCDONE and then just exit the process! So in this scenario that file resource cleanup will never get a chance to happen at all. Isn't that just a plain bug? > > > > > (2) misleading comment > > > > /* > > * Start a transaction on stream start, this transaction will be committed > > * on the stream stop. We need the transaction for handling the buffile, > > * used for serializing the streaming data and subxact info. > > */ > > > > That above comment (in the apply_handle_stream_start function) may now > > be inaccurate/misleading for the case of tablesync worker, because I > > think for tablesync you are explicitly avoiding the tx commit within > > the apply_handle_stream_stop(). > > > > Okay, I think we can slightly adjust the comment as: "Start a > transaction on stream start, this transaction will be committed on the > stream stop unless it is a tablesync worker in which case it will be > committed after processing all the messages. We need the transaction > for handling the buffile, used for serializing the streaming data and > subxact info.". The reworded comment looks ok now. --- Kind Regards, Peter Smith. Fujitsu Australia
В списке pgsql-bugs по дате отправления:
Предыдущее
От: ChristianДата:
Сообщение: Re: BUG #16745: delete does not prune partitions on declarative partitioned table
Следующее
От: Amit KapilaДата:
Сообщение: Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop