Re: [HACKERS] Get stuck when dropping a subscription duringsynchronizing table

Поиск
Список
Период
Сортировка
От Masahiko Sawada
Тема Re: [HACKERS] Get stuck when dropping a subscription duringsynchronizing table
Дата
Msg-id CAD21AoDA=F0Sa7j3UC-aYyW+wLWGV+axgECP0b=ab-=+J8TgdQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [HACKERS] Get stuck when dropping a subscription duringsynchronizing table  (Petr Jelinek <petr.jelinek@2ndquadrant.com>)
Ответы Re: [HACKERS] Get stuck when dropping a subscription duringsynchronizing table  (Masahiko Sawada <sawada.mshk@gmail.com>)
Список pgsql-hackers
On Wed, Jun 28, 2017 at 1:47 AM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:
>
> On 27/06/17 10:51, Masahiko Sawada wrote:
>> On Mon, Jun 26, 2017 at 12:12 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>
>> I've reviewed this patch briefly.
>
> Thanks!
>
>>
>> @@ -515,6 +533,31 @@ logicalrep_worker_stop(Oid subid, Oid relid)
>>  }
>>
>>  /*
>> + * Request worker to be stopped on commit.
>> + */
>> +void
>> +logicalrep_worker_stop_at_commit(Oid subid, Oid relid)
>> +{
>> +       LogicalRepWorkerId *wid;
>> +       MemoryContext old;
>> +
>> +       old = MemoryContextSwitchTo(TopTransactionContext);
>> +
>> +       wid = (LogicalRepWorkerId *) palloc(sizeof(LogicalRepWorkerId));
>> +
>> +       /*
>> +       wid = MemoryContextAlloc(TopTransactionContext,
>> +
>> sizeof(LogicalRepWorkerId));
>> +       */
>> +       wid->subid = subid;
>> +       wid->relid = relid;
>> +
>> +       on_commit_stop_workers = lappend(on_commit_stop_workers, wid);
>> +
>> +       MemoryContextSwitchTo(old);
>> +}
>>
>> logicalrep_worker_stop_at_commit() has a problem that new_list()
>> called by lappend() allocates the memory from current memory context.
>> It should switch the memory context and then allocate the memory for
>> wid and append it to the list.
>>

Thank you for updating the patch!

>
> Right, fixed (I see you did that locally as well based on the above
> excerpt ;) ).

Oops, yeah that's my test code.

>> --------
>> @@ -754,9 +773,25 @@ ApplyLauncherShmemInit(void)
>>  void
>>  AtEOXact_ApplyLauncher(bool isCommit)
>>  {
>> -       if (isCommit && on_commit_launcher_wakeup)
>> -               ApplyLauncherWakeup();
>> +       ListCell *lc;
>> +
>> +       if (isCommit)
>> +       {
>> +               foreach (lc, on_commit_stop_workers)
>> +               {
>> +                       LogicalRepWorkerId *wid = lfirst(lc);
>> +                       logicalrep_worker_stop(wid->subid, wid->relid);
>> +               }
>> +
>> +               if (on_commit_launcher_wakeup)
>> +                       ApplyLauncherWakeup();
>>
>> Stopping the logical rep worker in AtEOXact_ApplyLauncher doesn't
>> support the prepared transaction. Since we allocate the list
>> on_commit_stop_workers in TopTransactionContext the postgres crashes
>> if we execute any query after prepared transaction that removes
>> subscription relation state. Also after fixed this issue, we still
>> need to something: the list of on_commit_stop_workers is not
>> associated the prepared transaction.  A query next to "preapre
>> transaction" tries to stop workers at the commit. There was similar
>> discussion before.
>>
>
> Hmm, good point. I think for now it makes sense to simply don't allow
> PREPARE for transactions that manipulate workers similar to what we do
> when there are exported snapshots. Done it that way in attached.

Agreed. Should we note that in docs?

>
>> --------
>> +
>> +                               ensure_transaction();
>> +
>> UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
>> +
>>             rstate->relid, rstate->state,
>> +
>>             rstate->lsn);
>>
>>
>> Should we commit the transaction if we started a new transaction
>> before update the subscription relation state, or it could be deadlock
>> risk.
>
> We only lock the whole subscription (and only conflicting things are
> DROP and ALTER SUBSCRIPTION), not individual subscription-relation
> mapping so it doesn't seem to me like there is any higher risk of
> deadlocks than anything else which works with multiple tables (or
> compared to previous behavior).
>

I'm concerned that a lock for whole subscription could be conflicted
between ALTER SUBSCRIPTION, table sync worker and apply worker:

Please imagine the following case.

1. The apply worker updates a subscription relation state R1 of   subscription S1.   -> It acquires AccessShareLock on
S1,and keep holding.
 
2. ALTER SUBSCRIPTION SET PUBLICATION tries to acquire   AccessExclusiveLock on S1.   -> But it waits for the apply
workerto release the lock.
 
3. The apply worker calls wait_for_relation_state_change(relid,   SUBREL_STATE_SYNCDONE) and waits for the table sync
worker  for R2 to change its status.   -> Note that the apply worker is still holding AccessShareLock on S1
 
4. The table sync worker tries to update its status to SYNCDONE   -> In UpdateSubscriptionRelState(), it tires to
acquireAccessShareLock      on S1 but waits for it. *deadlock*
 

To summary, because the apply worker keeps holding AccessShareLock on
S1, the acquiring AccessExclusiveLock by ALTER SUBSCRIPTION waits for
the apply worker, and then the table sync worker also waits for the
ALTER SUBSCRIPTION in order to change its status. And the apply worker
waits for the table sync worker to change its status.

I encountered the similar case once on my environment. But since it
completely depends on timing I don't have the concrete reproducing
steps.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Noah Misch
Дата:
Сообщение: Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
Следующее
От: Thomas Munro
Дата:
Сообщение: Re: [HACKERS] Server crash due to SIGBUS(Bus Error) when trying toaccess the memory created using dsm_create().