Re: [HACKERS] Get stuck when dropping a subscription duringsynchronizing table

Поиск
Список
Период
Сортировка
От Masahiko Sawada
Тема Re: [HACKERS] Get stuck when dropping a subscription duringsynchronizing table
Дата
Msg-id CAD21AoCXTnTbHLrHPqshBM399ck+oGMg5PD-usK-5s+kQEv39g@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [HACKERS] Get stuck when dropping a subscription duringsynchronizing table  (Masahiko Sawada <sawada.mshk@gmail.com>)
Список pgsql-hackers
On Wed, Jun 28, 2017 at 2:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Wed, Jun 28, 2017 at 1:47 AM, Petr Jelinek
> <petr.jelinek@2ndquadrant.com> wrote:
>>
>> On 27/06/17 10:51, Masahiko Sawada wrote:
>>> On Mon, Jun 26, 2017 at 12:12 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>
>>> I've reviewed this patch briefly.
>>
>> Thanks!
>>
>>>
>>> @@ -515,6 +533,31 @@ logicalrep_worker_stop(Oid subid, Oid relid)
>>>  }
>>>
>>>  /*
>>> + * Request worker to be stopped on commit.
>>> + */
>>> +void
>>> +logicalrep_worker_stop_at_commit(Oid subid, Oid relid)
>>> +{
>>> +       LogicalRepWorkerId *wid;
>>> +       MemoryContext old;
>>> +
>>> +       old = MemoryContextSwitchTo(TopTransactionContext);
>>> +
>>> +       wid = (LogicalRepWorkerId *) palloc(sizeof(LogicalRepWorkerId));
>>> +
>>> +       /*
>>> +       wid = MemoryContextAlloc(TopTransactionContext,
>>> +
>>> sizeof(LogicalRepWorkerId));
>>> +       */
>>> +       wid->subid = subid;
>>> +       wid->relid = relid;
>>> +
>>> +       on_commit_stop_workers = lappend(on_commit_stop_workers, wid);
>>> +
>>> +       MemoryContextSwitchTo(old);
>>> +}
>>>
>>> logicalrep_worker_stop_at_commit() has a problem that new_list()
>>> called by lappend() allocates the memory from current memory context.
>>> It should switch the memory context and then allocate the memory for
>>> wid and append it to the list.
>>>
>
> Thank you for updating the patch!
>
>>
>> Right, fixed (I see you did that locally as well based on the above
>> excerpt ;) ).
>
> Oops, yeah that's my test code.
>
>>> --------
>>> @@ -754,9 +773,25 @@ ApplyLauncherShmemInit(void)
>>>  void
>>>  AtEOXact_ApplyLauncher(bool isCommit)
>>>  {
>>> -       if (isCommit && on_commit_launcher_wakeup)
>>> -               ApplyLauncherWakeup();
>>> +       ListCell *lc;
>>> +
>>> +       if (isCommit)
>>> +       {
>>> +               foreach (lc, on_commit_stop_workers)
>>> +               {
>>> +                       LogicalRepWorkerId *wid = lfirst(lc);
>>> +                       logicalrep_worker_stop(wid->subid, wid->relid);
>>> +               }
>>> +
>>> +               if (on_commit_launcher_wakeup)
>>> +                       ApplyLauncherWakeup();
>>>
>>> Stopping the logical rep worker in AtEOXact_ApplyLauncher doesn't
>>> support the prepared transaction. Since we allocate the list
>>> on_commit_stop_workers in TopTransactionContext the postgres crashes
>>> if we execute any query after prepared transaction that removes
>>> subscription relation state. Also after fixed this issue, we still
>>> need to something: the list of on_commit_stop_workers is not
>>> associated the prepared transaction.  A query next to "preapre
>>> transaction" tries to stop workers at the commit. There was similar
>>> discussion before.
>>>
>>
>> Hmm, good point. I think for now it makes sense to simply don't allow
>> PREPARE for transactions that manipulate workers similar to what we do
>> when there are exported snapshots. Done it that way in attached.
>
> Agreed. Should we note that in docs?
>
>>
>>> --------
>>> +
>>> +                               ensure_transaction();
>>> +
>>> UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
>>> +
>>>             rstate->relid, rstate->state,
>>> +
>>>             rstate->lsn);
>>>
>>>
>>> Should we commit the transaction if we started a new transaction
>>> before update the subscription relation state, or it could be deadlock
>>> risk.
>>
>> We only lock the whole subscription (and only conflicting things are
>> DROP and ALTER SUBSCRIPTION), not individual subscription-relation
>> mapping so it doesn't seem to me like there is any higher risk of
>> deadlocks than anything else which works with multiple tables (or
>> compared to previous behavior).
>>
>
> I'm concerned that a lock for whole subscription could be conflicted
> between ALTER SUBSCRIPTION, table sync worker and apply worker:
>
> Please imagine the following case.
>
> 1. The apply worker updates a subscription relation state R1 of
>     subscription S1.
>     -> It acquires AccessShareLock on S1, and keep holding.
> 2. ALTER SUBSCRIPTION SET PUBLICATION tries to acquire
>     AccessExclusiveLock on S1.
>     -> But it waits for the apply worker to release the lock.
> 3. The apply worker calls wait_for_relation_state_change(relid,
>     SUBREL_STATE_SYNCDONE) and waits for the table sync worker
>     for R2 to change its status.
>     -> Note that the apply worker is still holding AccessShareLock on S1
> 4. The table sync worker tries to update its status to SYNCDONE
>     -> In UpdateSubscriptionRelState(), it tires to acquire AccessShareLock
>        on S1 but waits for it. *deadlock*
>
> To summary, because the apply worker keeps holding AccessShareLock on
> S1, the acquiring AccessExclusiveLock by ALTER SUBSCRIPTION waits for
> the apply worker, and then the table sync worker also waits for the
> ALTER SUBSCRIPTION in order to change its status. And the apply worker
> waits for the table sync worker to change its status.
>
> I encountered the similar case once on my environment. But since it
> completely depends on timing I don't have the concrete reproducing
> steps.
>

Also, now the patch conflicts with current HEAD, so need to be updated.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: [HACKERS] Race-like failure in recovery/t/009_twophase.pl
Следующее
От: Amit Langote
Дата:
Сообщение: Re: [HACKERS] Multi column range partition table