Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

Поиск
Список
Период
Сортировка
От Dilip Kumar
Тема Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop
Дата
Msg-id CAFiTN-sn5odfWKAB2UM14NbtWx_bn6RXSJpeMXaezc+ANf0Png@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop  (Amit Kapila <amit.kapila16@gmail.com>)
Список pgsql-bugs
On Sat, Nov 7, 2020 at 9:23 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Nov 7, 2020 at 5:31 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> >
> > On 2020-Nov-05, Amit Kapila wrote:
> >
> > > On Wed, Nov 4, 2020 at 7:19 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > > >
> > > > On 2020-Nov-04, Amit Kapila wrote:
> > > >
> > > > > On Thu, Oct 15, 2020 at 8:20 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > > >
> > > > > > * STREAM COMMIT bug?
> > > > > >   In apply_handle_stream_commit, we do CommitTransactionCommand, but
> > > > > >   apparently in a tablesync worker we shouldn't do it.
> > > > >
> > > > > In the tablesync stage, we don't allow streaming. See pgoutput_startup
> > > > > where we disable streaming for the init phase. As far as I understand,
> > > > > for tablesync we create the initial slot during which streaming will
> > > > > be disabled then we will copy the table (here logical decoding won't
> > > > > be used) and then allow the apply worker to get any other data which
> > > > > is inserted in the meantime. Now, I might be missing something here
> > > > > but if you can explain it a bit more or share some test to show how we
> > > > > can reach here via tablesync worker then we can discuss the possible
> > > > > solution.
> > > >
> > > > Hmm, okay, that sounds like there would be no bug then.  Maybe what we
> > > > need is just an assert in apply_handle_stream_commit that
> > > > !am_tablesync_worker(), as in the attached patch.  Passes tests.
> > > >
> > >
> > > +1. But do we want to have this Assert only in stream_commit API or
> > > all stream APIs as well?
> >
> > Well, the only reason I care about this is that apply_handle_commit
> > contains a comment that we must not do CommitTransactionCommand in the
> > syncworker case; so if you look at apply_handle_stream_commit and note
> > that it doesn't concern it about that, you become concerned that it
> > might be broken.  I don't think the other routines handling the "stream"
> > thing have that issue.
> >
>
> Fair enough, as mentioned in my previous email, I think we need to
> confirm once that after copy how the decoding happens on upstream for
> transactions during the phase where tablesync workers is moving to
> state SUBREL_STATE_SYNCDONE from SUBREL_STATE_CATCHUP. I'll try to
> come up (in next few days) with some test case to debug and test this
> particular scenario and share my findings.

IIUC, the table sync worker does the initial copy using the consistent
snapshot.  And after that, if the main apply worker is behind us then
it will wait for the apply worker to reach the table sync worker's
start point and after that, the apply worker can continue applying the
changes.  OTOH, of the apply worker have already moved ahead in
processing the WAL after it had launched the table sync worker that
means the apply worker would have skipped those many transactions as
the table was not in SYNC DONE state so now the table sync worker need
to cover this gap by applying the walls using normal apply path and it
can be moved to the SYNC done state once it catches up with the actual
apply worker.  After putting the table sync worker in the catchup
state the apply worker will wait for the table sync worker to catchup.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



В списке pgsql-bugs по дате отправления:

Предыдущее
От: Amit Kapila
Дата:
Сообщение: Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop
Следующее
От: Dilip Kumar
Дата:
Сообщение: Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop