Re: Deadlock between logrep apply worker and tablesync worker

Поиск
Список
Период
Сортировка
От Peter Smith
Тема Re: Deadlock between logrep apply worker and tablesync worker
Дата
Msg-id CAHut+PuQSQjL_Wd_-8GncNjzX_qMiCg_hP2OTZ1gY4GtQ3tNqQ@mail.gmail.com
обсуждение исходный текст
Ответ на RE: Deadlock between logrep apply worker and tablesync worker  ("houzj.fnst@fujitsu.com" <houzj.fnst@fujitsu.com>)
Список pgsql-hackers
On Fri, Feb 3, 2023 at 6:58 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> On Thursday, February 2, 2023 7:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Feb 2, 2023 at 12:05 PM houzj.fnst@fujitsu.com
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > On Tuesday, January 31, 2023 1:07 AM vignesh C <vignesh21@gmail.com>
> > wrote:
> > > > On Mon, 30 Jan 2023 at 17:30, vignesh C <vignesh21@gmail.com> wrote:
> > > >
> > >
> > > I also tried to test the time of "src/test/subscription/t/002_types.pl"
> > > before and after the patch(change the lock level) and Tom's
> > > patch(split
> > > transaction) like what Vignesh has shared on -hackers.
> > >
> > > I run about 100 times for each case. Tom's and the lock level patch
> > > behave similarly on my machines[1].
> > >
> > > HEAD: 3426 ~ 6425 ms
> > > HEAD + Tom: 3404 ~ 3462 ms
> > > HEAD + Vignesh: 3419 ~ 3474 ms
> > > HEAD + Tom + Vignesh: 3408 ~ 3454 ms
> > >
> > > Even apart from the testing time reduction, reducing the lock level
> > > and lock the specific object can also help improve the lock contention
> > > which user(that use the exposed function) , table sync worker and
> > > apply worker can also benefit from it. So, I think pushing the patch to change
> > the lock level makes sense.
> > >
> > > And the patch looks good to me.
> > >
> >
> > Thanks for the tests. I also see a reduction in test time variability with Vignesh's
> > patch. I think we can release the locks in case the origin is concurrently
> > dropped as in the attached patch. I am planning to commit this patch
> > tomorrow unless there are more comments or objections.
> >
> > > While on it, after pushing the patch, I think there is another case
> > > might also worth to be improved, that is the table sync and apply
> > > worker try to drop the same origin which might cause some delay. This
> > > is another case(different from the deadlock), so I feel we can try to improve
> > this in another patch.
> > >
> >
> > Right, I think that case could be addressed by Tom's patch to some extent but
> > I am thinking we should also try to analyze if we can completely avoid the need
> > to remove origins from both processes. One idea could be to introduce
> > another relstate something like PRE_SYNCDONE and set it in a separate
> > transaction before we set the state as SYNCDONE and remove the slot and
> > origin in tablesync worker.
> > Now, if the tablesync worker errors out due to some reason during the second
> > transaction, it can remove the slot and origin after restart by checking the state.
> > However, it would add another relstate which may not be the best way to
> > address this problem. Anyway, that can be accomplished as a separate patch.
>
> Here is an attempt to achieve the same.
> Basically, the patch removes the code that drop the origin in apply worker. And
> add a new state PRE_SYNCDONE after synchronization finished in front of apply
> (sublsn set), but before dropping the origin and other final cleanups. The
> tablesync will restart and redo the cleanup if it failed after reaching the new
> state. Besides, since the changes can already be applied on the table in
> PRE_SYNCDONE state, so I also modified the check in
> should_apply_changes_for_rel(). And some other conditions for the origin drop
> in subscription commands are were adjusted in this patch.
>


BTW, the tablesync.c has a large file header comment which describes
all about the relstates including some examples. So this patch needs
to include modifications to that comment.

------
Kind Regards,
Peter Smith.
Fujitsu Australia.



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Nathan Bossart
Дата:
Сообщение: Re: recovery modules
Следующее
От: Andres Freund
Дата:
Сообщение: Re: Logical replication timeout problem