Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Поиск
Список
Период
Сортировка
От Dilip Kumar
Тема Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
Дата
Msg-id CAFiTN-udJ-hMy5mBTOx3n7ejAs2cVsnfjboy1QmRgoZRf=KbNQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions  (Dilip Kumar <dilipbalaut@gmail.com>)
Ответы Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions  (Ajin Cherian <itsajin@gmail.com>)
Список pgsql-hackers
On Mon, Jul 13, 2020 at 10:47 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sun, Jul 12, 2020 at 9:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Jul 6, 2020 at 11:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Mon, Jul 6, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Sun, Jul 5, 2020 at 4:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > > >
> > > > > On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > > >
> > > > >
> > > > > > 9.
> > > > > > +ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
> > > > > > {
> > > > > > ..
> > > > > > + ReorderBufferToastReset(rb, txn);
> > > > > > + if (specinsert != NULL)
> > > > > > + ReorderBufferReturnChange(rb, specinsert);
> > > > > > ..
> > > > > > }
> > > > > >
> > > > > > Why do we need to do these here when we wouldn't have been done for
> > > > > > any exception other than ERRCODE_TRANSACTION_ROLLBACK?
> > > > >
> > > > > Because we are handling this exception "ERRCODE_TRANSACTION_ROLLBACK"
> > > > > gracefully and we are continuing with further decoding so we need to
> > > > > return this change back.
> > > > >
> > > >
> > > > Okay, then I suggest we should do these before calling stream_stop and
> > > > also move ReorderBufferResetTXN after calling stream_stop  to follow a
> > > > pattern similar to try block unless there is a reason for not doing
> > > > so.  Also, it would be good if we can initialize specinsert with NULL
> > > > after returning the change as we are doing at other places.
> > >
> > > Okay
> > >
> > > > > > 10.  I have got the below failure once.  I have not investigated this
> > > > > > in detail as the patch is still under progress.  See, if you have any
> > > > > > idea?
> > > > > > #   Failed test 'check extra columns contain local defaults'
> > > > > > #   at t/013_stream_subxact_ddl_abort.pl line 81.
> > > > > > #          got: '2|0'
> > > > > > #     expected: '1000|500'
> > > > > > # Looks like you failed 1 test of 2.
> > > > > > make[2]: *** [check] Error 1
> > > > > > make[1]: *** [check-subscription-recurse] Error 2
> > > > > > make[1]: *** Waiting for unfinished jobs....
> > > > > > make: *** [check-world-src/test-recurse] Error 2
> > > > >
> > > > > Even I got the failure once and after that, it did not reproduce.  I
> > > > > have executed it multiple time but it did not reproduce again.  Are
> > > > > you able to reproduce it consistently?
> > > > >
> > > >
> > > > No, I am also not able to reproduce it consistently but I think this
> > > > can fail if a subscriber sends the replay_location before actually
> > > > replaying the changes.  First, I thought that extra send_feedback we
> > > > have in apply_handle_stream_commit might have caused this but I guess
> > > > that can't happen because we need the commit time location for that
> > > > and we are storing the same at the end of apply_handle_stream_commit
> > > > after applying all messages.  I am not sure what is going on here.  I
> > > > think we somehow need to reproduce this or some variant of this test
> > > > consistently to find the root cause.
> > >
> > > And I think it appeared first time for me,  so maybe either induced
> > > from past few versions so some changes in the last few versions might
> > > have exposed it.  I have noticed that almost 50% of the time I am able
> > > to reproduce after the clean build so I can trace back from which
> > > version it started appearing that way it will be easy to narrow down.
> >
> > I think the reason for the failure is that we are not setting
> > remote_final_lsn, in the streaming mode.  I have put multiple logs and
> > executed in log and from logs it appeared that some of the logical wal
> > did not get replayed due to below check in
> > should_apply_changes_for_rel.
> > return (rel->state == SUBREL_STATE_READY || (rel->state ==
> > SUBREL_STATE_SYNCDONE && rel->statelsn <= remote_final_lsn));
> >
> > I still need to do the detailed analysis that why does this fail in
> > some cases,  basically, most of the time the rel->state is
> > SUBREL_STATE_READY so this check passes but whenever the state is
> > SUBREL_STATE_SYNCDONE it failed because we never update
> > remote_final_lsn.  I will try to set this value in
> > apply_handle_stream_commit and see whether it ever fails or not.
>
> I have verified that after setting the remote_final_lsn in the
> apply_handle_stream_commit, I don't see that regression failure in
> over 70 runs whereas without that change it failed 6 times in 50 runs.
> Apart from this, I have noticed one more thing related to the same
> point.  Basically, in the apply_handle_commit, we are calling
> process_syncing_tables whereas we are not calling the same in
 > apply_handle_stream_commit.

I have set the remote_final_lsn as well as called
process_syncing_tables, in apply_handle_stream_commit.  Please see the
latest patch set v33.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Dilip Kumar
Дата:
Сообщение: Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
Следующее
От: vignesh C
Дата:
Сообщение: Re: [PATCH] Performance Improvement For Copy From Binary Files