Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop
От | Peter Smith |
---|---|
Тема | Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop |
Дата | |
Msg-id | CAHut+PvnbS+tHarZPVm5dxMsASEHmX2XL=QtGKYkn5F6A82JFQ@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop (Amit Kapila <amit.kapila16@gmail.com>) |
Ответы |
Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop
(Amit Kapila <amit.kapila16@gmail.com>)
|
Список | pgsql-bugs |
On Thu, Nov 26, 2020 at 1:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Nov 26, 2020 at 6:47 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > On Wed, Nov 25, 2020 at 7:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > I thought there should be a couple small > > > > changes made for this patch, as follows. > > > > > > > > (1) process_sync_tables should be last > > > > > > > > IMO the process_syncing_tables should be always the *last* thing any > > > > apply handler does, because it seems a bad idea for the worker to > > > > change state (e.g. to say it is SUBREL_STATE_SYNCDONE) if in fact > > > > there is still a little bit more processing remaining. > > > > > > > > > > Hmm, what makes you think it is a bad idea, is there any comment which > > > makes you feel so? As far as I understand, the only thing tablesync > > > worker needs to ensure before reaching that state is it should apply > > > till the required lsn which is done by the time it is called in the > > > patch. I think doing it sooner is better because it will let apply > > > worker start its work. In any case, this is not related to this > > > bug-fix patch, so, if you want to build a case for doing it later then > > > we can discuss it separately. > > > > Yes, this review issue is independent of the initial streaming bug the > > patch is solving. But by refactoring this exact line of code into a > > new function apply_handle_commit_internal() I felt the patch is kind > > of now taking ownership of it - so it is not *really* unrelated to > > this patch anymore... > > > > (However if you prefer to treat this review item as a separate issue > > it is fine by me). > > > > Let's try to see if we can have a common understanding before I push > this. I don't think this is an issue neither did you presented any > theory or test why you believe it is an issue. > Right, there is no "issue", I only felt current code is more confusing that it needs to be: - Since process_syncing_tables is called many times by apply handlers I prefered a common code pattern (e.g. always last) because I would find it easier to understand. YMMV. - And I also thought it would be neater to simply move the process_syncing_tables by one line so you can do the stream file resource cleanup explicitly the normal way instead of relying on implicit cleanup as the tablesync process exits. But I also understand you may choose to leave everything as-is and that is still OK too. > > AFAIK the process_syncing_tables function acts like the main handshake > > mechanism between the "tablesync" and "apply" workers. > > e.g. From "tablesync" worker POV - am I finished? Should I tell the > > apply worker I am DONE? Is it time to exit? > > e.g. From "apply" worker POV - have the tablesyncs caught up yet? Can I proceed? > > > > Here, I think we need to have a common understanding of finished and > or DONE. It is not defined by where that function is called but when > it is appropriate to call it. I have tried to explain in my previous > response but not sure you have read it carefully. > > > I do think these handshake functions ought to be consistently located > > (e.g always kept last in the apply handler functions where possible). > > > > There is no harm in doing so but you haven't presented any theory > which makes me feel that is correct or more appropriate. > > > It is true, a slightly different placement (e.g. like current code) > > still passes the tests, > > > > Hmm, it is not only about passing the tests. It is not always true > that if tests pass, the code is correct especially in such cases where > writing test is not feasible. We need to understand the design > principle behind doing so which I am trying to explain. > > > but I see more problems than benefits to keep > > it that way. IIUC for the tablesync worker to be executing some of > > these handlers at all is a quite a rare occurrence caused by unusual > > timing. > > It is due to the reason that we want tablesync worker to apply the WAL > up to LSN apply worker has read by that time so that the apply worker > can continue from there. I think it is not always/only "up to LSN apply worker has read". IIUC, my tests of deliberately sending messages while tablesync is paused in the debugger means the tablesync worker will get *ahead* of the apply worker. > > > I think putting the process_syncing_tables() call earlier with > > the objective to allow apply worker to proceed a few ms earlier (for > > the already rare case) is not worth it. > > > > I think that is not the primary objective. It is also that having it > in common function as the patch is doing allows us to not accidentally > remove it or forget calling it from some other similar place. OTOH, I > don't see any clear advantage of moving out of the common function. > > > OTOH, putting the handshake function consistently last does have benefits: > > - the consistent code is easier to read > > - none of the tablesync timing stuff is very easy to test (certainly > > not with automated regression tests). Keeping code consistent means > > less risk of introducing some unforeseen/untestable issue > > > > - that stream_cleanup_files(...) called by > > apply_handle_stream_commit() is supposed to be cleaning up file > > resources. IIUC by allowing the tablesync worker to call > > process_syncing_tables() before this cleanup it means the tablesync > > worker may decide that it is SUBREL_STATE_SYNCDONE and then just exit > > the process! So in this scenario that file resource cleanup will never > > get a chance to happen at all. Isn't that just a plain bug? > > > > No, we clean up the files on process exit (via > SharedFileSetDeleteOnProcExit). If we don't have such a mechism, it > would be a problem in more number of cases then you describe. See > comments atop src/backend/replication/logical/worker.c (especially the > STREAMED TRANSACTIONS section). Got it. Thanks. --- Kind Regards, Peter Smith. Fujitsu Australia
В списке pgsql-bugs по дате отправления:
Предыдущее
От: David RowleyДата:
Сообщение: Re: BUG #16745: delete does not prune partitions on declarative partitioned table
Следующее
От: Walter WeinmannДата:
Сообщение: Re: BUG #16736: SCRAM authentication is not supported by this driver