Re: [HACKERS] Cannot shutdown subscriber after DROP SUBSCRIPTION

Поиск
Список
Период
Сортировка
От Fujii Masao
Тема Re: [HACKERS] Cannot shutdown subscriber after DROP SUBSCRIPTION
Дата
Msg-id CAHGQGwGyXNY1LWXY21W1GWsSafLLJNzxNk1EJ8DqW9_-6oH6tw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [HACKERS] Cannot shutdown subscriber after DROP SUBSCRIPTION  (Petr Jelinek <petr.jelinek@2ndquadrant.com>)
Ответы Re: [HACKERS] Cannot shutdown subscriber after DROP SUBSCRIPTION  (Masahiko Sawada <sawada.mshk@gmail.com>)
Список pgsql-hackers
On Tue, Feb 7, 2017 at 2:13 AM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:
> On 06/02/17 17:33, Fujii Masao wrote:
>> On Sun, Feb 5, 2017 at 5:11 AM, Petr Jelinek
>> <petr.jelinek@2ndquadrant.com> wrote:
>>> On 03/02/17 19:38, Fujii Masao wrote:
>>>> On Sat, Feb 4, 2017 at 12:49 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>>>> On Fri, Feb 3, 2017 at 10:56 AM, Kyotaro HORIGUCHI
>>>>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>>>>> At Fri, 3 Feb 2017 01:02:47 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
<CAHGQGwHqQVHmQ7wM=eLNnp1_oy-GVSSAcaJXWjE4nc2twSqXRg@mail.gmail.com>
>>>>>>> On Thu, Feb 2, 2017 at 2:36 PM, Michael Paquier
>>>>>>> <michael.paquier@gmail.com> wrote:
>>>>>>>> On Thu, Feb 2, 2017 at 2:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>>>>>>> Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:
>>>>>>>>>> Then, the reason for the TRY-CATCH cluase is that I found that
>>>>>>>>>> some functions called from there can throw exceptions.
>>>>>>>>>
>>>>>>>>> Yes, but all LWLocks should be released by normal error recovery.
>>>>>>>>> It should not be necessary for this code to clean that up by hand.
>>>>>>>>> If it were necessary, there would be TRY-CATCH around every single
>>>>>>>>> LWLockAcquire in the backend, and we'd have an unreadable and
>>>>>>>>> unmaintainable system.  Please don't add a TRY-CATCH unless it's
>>>>>>>>> *necessary* -- and you haven't explained why this one is.
>>>>>>>
>>>>>>> Yes.
>>>>>>
>>>>>> Thank you for the suggestion. I minunderstood that.
>>>>>>
>>>>>>>> Putting hands into the code and at the problem, I can see that
>>>>>>>> dropping a subscription on a node makes it unresponsive in case of a
>>>>>>>> stop. And that's just because calls to LWLockRelease are missing as in
>>>>>>>> the patch attached. A try/catch problem should not be necessary.
>>>>>>>
>>>>>>> Thanks for the patch!
>>>>>>>
>>>>>>> With the patch, LogicalRepLauncherLock is released at the end of
>>>>>>> DropSubscription(). But ISTM that the lock should be released just after
>>>>>>> logicalrep_worker_stop() and there is no need to protect the removal of
>>>>>>> replication slot with the lock.
>>>>>>
>>>>>> That's true. logicalrep_worker_stop returns after confirmig that
>>>>>> worker->proc is cleard, so no false relaunch cannot be caused.
>>>>>> After all, logicalrep_worker_stop is surrounded by
>>>>>> LWLockAcquire/Relase pair. So it can be moved into the funciton
>>>>>> and make the lock secrion to be more narrower.
>>>>
>>>> If we do this, Assert(LWLockHeldByMe(LogicalRepLauncherLock)) should be
>>>> removed and the comment for logicalrep_worker_stop() should be updated.
>>>>
>>>> Your approach may cause the deadlock. The launcher takes LogicalRepWorkerLock
>>>> while holding LogicalRepLauncherLock. OTOH, with your approach,
>>>> logicalrep_worker_stop() takes LogicalRepLauncherLock while holding
>>>> LogicalRepWorkerLock.
>>>>
>>>> Therefore I pushed the simple patch which adds LWLockRelease() just after
>>>> logicalrep_worker_stop().
>>>>
>>>> Another problem that I found while reading the code is that the launcher can
>>>> start up the worker with the subscription that DROP SUBSCRIPTION just removed.
>>>> That is, DROP SUBSCRIPTION removes the target entry from pg_subscription,
>>>> but the launcher can see it and start new worker until the transaction for
>>>> DROP has been committed.
>>>>
>>>
>>> That was the reason why DropSubscription didn't release the lock in the
>>> first place. It was supposed to be released at the end of the
>>> transaction though.
>>
>> OK, I understood why you used the lock in that way. But using LWLock
>> for that purpose is not valid.
>>
>
> Yeah, I just tried to avoid what we are doing now really hard :)
>
>>>> To fix this issue, I think that DROP SUBSCRIPTION should take
>>>> AccessExclusiveLock on pg_subscription, instead of RowExclusiveLock,
>>>> so that the launcher cannot see the entry to be being removed.
>>>>
>>>
>>> The whole point of having LogicalRepLauncherLock is to avoid having to
>>> do this, so if we do this we could probably get rid of it.
>>
>> Yes, let's remove LogicalRepLauncherLock and lock pg_subscription
>> with AccessExclusive mode at the beginning of DROP SUBSCRIPTION.
>> Attached patch does this.
>>
>
> Okay, looks reasonable to me.

Thanks for the review!
But ISMT that I should suspend committing the patch until we fix the issue
that Sawada reported in other thread. That bugfix may change the related
code and design very much.
https://www.postgresql.org/message-id/CAD21AoD+VO93zZ4ZQtZQb-jZ_wMko3OgGdx1MXO4T+8q_zHDDA@mail.gmail.com

Regards,

-- 
Fujii Masao



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andreas Karlsson
Дата:
Сообщение: Re: [HACKERS] 'text' instead of 'unknown' in Postgres 10
Следующее
От: Joel Jacobson
Дата:
Сообщение: [HACKERS] Idea on how to simplify comparing two sets