Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION
Дата
Msg-id 8948e4a2-fec7-8a61-9ace-2537cee3ef6f@enterprisedb.com
обсуждение исходный текст
Ответ на Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION  (Amit Kapila <amit.kapila16@gmail.com>)
Ответы Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION  (Amit Kapila <amit.kapila16@gmail.com>)
Список pgsql-hackers

On 11/21/23 14:16, Amit Kapila wrote:
> On Tue, Nov 21, 2023 at 5:17 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> I decided to do some stress-testing of the built-in logical replication,
>> as part of the sequence decoding work. And I soon ran into an undetected
>> deadlock related to ALTER SUBSCRIPTION ... REFRESH PUBLICATION :-(
>>
>> The attached bash scripts triggers that in a couple seconds for me. The
>> script looks complicated, but most of the code is waiting for sync to
>> complete, catchup, and that sort of thing.
>>
>> What the script does is pretty simple:
>>
>> 1) initialize two clusters, set them as publisher/subscriber pair
>>
>> 2) create some number of tables, add them to publication and wait for
>>    the sync to complete
>>
>> 3) start two pgbench runs in the background, modifying the publication
>>    (one removes+adds all tables in a single transaction, one does that
>>     with transaction per table)
>>
>> 4) run refresh.sh which does ALTER PUBLICATION ... REFRESH PUBLICATION
>>    in a loop (now that I think about it, could be another pgbench
>>    script, but well ...)
>>
>> 5) some consistency checks, but the lockup happens earlier so this does
>>    not really matter
>>
>> After a small number of refresh cycles (for me it's usually a couple
>> dozen), we end up with a couple stuck locks (I shortened the backend
>> type string a bit, for formatting reasons):
>>
>>   test=# select a.pid, classid, objid, backend_type, query
>>            from pg_locks l join pg_stat_activity a on (a.pid = l.pid)
>>           where not granted;
>>
>>      pid   | classid | objid | backend_type     |         query
>>   ---------+---------+-------+------------------+----------------------
>>    2691941 |    6100 | 16785 | client backend   | ALTER SUBSCRIPTION s
>>                                                   REFRESH PUBLICATION
>>    2691837 |    6100 | 16785 | tablesync worker |
>>    2691936 |    6100 | 16785 | tablesync worker |
>>   (3 rows)
>>
>> All these backends wait for 6100/16785, which is the subscription row in
>> pg_subscription. The tablesync workers are requesting AccessShareLock,
>> the client backend however asks for AccessExclusiveLock.
>>
>> The entry is currently locked by:
>>
>>   test=# select a.pid, mode, backend_type from pg_locks l
>>            join pg_stat_activity a on (a.pid = l.pid)
>>           where classid=6100 and objid=16785 and granted;
>>
>>      pid   |      mode       |           backend_type
>>   ---------+-----------------+----------------------------------
>>    2690477 | AccessShareLock | logical replication apply worker
>>   (1 row)
>>
>> But the apply worker is not waiting for any locks, so what's going on?
>>
>> Well, the problem is the apply worker is waiting for notification from
>> the tablesync workers the relation is synced, which happens through
>> updating the pg_subscription_rel row. And that wait happens in
>> wait_for_relation_state_change, which simply checks the row in a loop,
>> with a sleep by WaitLatch().
>>
>> Unfortunately, the tablesync workers can't update the row because the
>> client backend executing ALTER SUBSCRIPTION ... REFRESH PUBLICATION
>> sneaked in, and waits for an AccessExclusiveLock. So the tablesync
>> workers are stuck in the queue and can't proceed.
>>
>> The client backend can't proceed, because it's waiting for a lock held
>> by the apply worker.
>>
> 
> It seems there is some inconsistency in what you have written for
> client backends/tablesync worker vs. apply worker. The above text
> seems to be saying that the client backend and table sync worker are
> waiting on a "subscription row in pg_subscription" and the apply
> worker is operating on "pg_subscription_rel". So, if that is true then
> they shouldn't get stuck.
> 
> I think here client backend and tablesync worker seems to be blocked
> for a lock on pg_subscription_rel.
> 

Not really, they are all locking the subscription. All the locks are on
classid=6100, which is pg_subscription:

  test=# select 6100::regclass;
      regclass
  -----------------
   pg_subscription
  (1 row)

The thing is, the tablesync workers call UpdateSubscriptionRelState,
which locks the pg_subscription catalog at the very beginning:

   LockSharedObject(SubscriptionRelationId, ...);

So that's the issue. I haven't explored why it's done this way, and
there's no comment explaining locking the subscriptions is needed ...

>> The tablesync workers can't proceed because their lock request is stuck
>> behind the AccessExclusiveLock request.
>>
>> And the apply worker can't proceed, because it's waiting for status
>> update from the tablesync workers.
>>
> 
> This part is not clear to me because
> wait_for_relation_state_change()->GetSubscriptionRelState() seems to
> be releasing the lock while closing the relation. Am, I missing
> something?
> 

I think you're missing the fact that GetSubscriptionRelState() acquires
and releases the lock on pg_subscription_rel, but that's not the lock
causing the issue. The problem is the lock on the pg_subscription row.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Eisentraut
Дата:
Сообщение: Re: meson documentation build open issues
Следующее
От: Robert Haas
Дата:
Сообщение: Re: Annoying build warnings from latest Apple toolchain