Re: [HACKERS] Logical replication: stuck spinlock atReplicationSlotRelease

Поиск
Список
Период
Сортировка
От Petr Jelinek
Тема Re: [HACKERS] Logical replication: stuck spinlock atReplicationSlotRelease
Дата
Msg-id 73a45182-7c39-143a-9d10-2f73e28d2c8a@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: [HACKERS] Logical replication: stuck spinlock at ReplicationSlotRelease  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
On 24/06/17 04:50, Tom Lane wrote:
> Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
>> Do you want to take a look at move those elog calls around a bit?  That
>> should do it.
> 
> It would be a good idea to have some clarity on *why* that should do it.
> 
> Looking at the original report's log, but without having actually
> reproduced the problem, I guess what is happening is this:
> 
> 1. Subscription worker process (23117) gets a duplicate key conflict while
> trying to apply an update, and in consequence it exits.  (Is that supposed
> to happen?)
> 
> 2. Publication server process (23124) doesn't notice client connection
> loss right away.  By chance, the next thing it tries to send to the client
> is the debug output from LogicalIncreaseRestartDecodingForSlot.  Then it
> detects loss of connection (at 2017-06-21 14:55:12.033) and FATAL's out.
> But since the spinlock stuff has no tracking infrastructure, we don't
> know we are still holding the replication slot mutex.
> 
> 3. Process exit cleanup does know that it's supposed to release the
> replication slot, so it tries to take the mutex spinlock ... again.
> Eventually that times out and we get the "stuck spinlock" panic.
> 
> All correct so far?

Sounds about right.

> So, okay, the proximate cause of the crash is a blatant violation of the
> rule that spinlocks may only be held across straight-line code segments.
> But I'm wondering about the client exit having occurred in the first
> place.  Why is that, and how would one ever recover?  It sure looks
> like this isn't the first subscription worker process that has tried
> and failed to apply the update.  If our attitude towards this situation is
> that it's okay to fork-bomb your server with worker processes continually
> respawning and making no progress, well, I don't think that's good enough.
> 

Well, we don't have conflict detection/handling in PG10 like for example
pglogical does. Even once we'll have that it will not be able to resolve
multiple unique index violations probably (there is no obvious way how
to do that automatically). And we can't really progress when there is an
unresolved constraint violation. To recover one has to either remove the
conflicting row on downstream or remove the transaction from replication
upstream by manually consuming it using
pg_logical_slot_get_binary_changes. Now that's arguably somewhat ugly
interface to do it, we might want to invent nicer interface for that
even for PG10, but it would mean catalog bump so it should be rather
soon if we'll go there.

As for fork-bombing, it should be very slow fork bomb (we rate-limit the
worker starting) but it's not ideal situation I agree with that. I am
open to suggestions what we can do there, if we had some kind of list of
non-recoverable errors we could automatically disable the subscription
on them (although we need to be able to modify the catalog for that
which may not be possible in an unrecoverable error) but it's not clear
to me how to reasonably produce such a list.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Petr Jelinek
Дата:
Сообщение: Re: [HACKERS] CREATE SUBSCRIPTION log noise
Следующее
От: Petr Jelinek
Дата:
Сообщение: Re: [HACKERS] Walsender timeouts and large transactions