Обсуждение: A proposal to force-drop replication slots to make disabling async/sync standbys or logical replication faster in production environments

Поиск
Список
Период
Сортировка
Hi,

Currently postgres doesn't allow dropping a replication slot that's active [1]. This can make certain operations more time-consuming or stuck in production environments. These operations are - disable async/sync standbys and disable logical replication that require the postgres running on standby or the subscriber to go down. If stopping postgres server takes time, the VM or container will have to be killed forcefully which can take a considerable amount of time as there are many layers in between.

How about we provide a function to force-drop a replication slot? All other things such as stopping postgres and gracefully unprovisioning VM etc. can be taken care of in the background. This force-drop function will also have to ensure that the walsender that's active for the replication slot is terminated gracefully without letting postmaster restart the other backends (right now if a wal sender is exited/terminated, the postmaster restarts all other backends too). The main advantage of the force-drop function is that the disable operations can be quicker and there is no down time/crash on the primary/source server.

Thoughts?

[1] ERROR:  replication slot "foo" is active for PID 2598155

Regards,
Bharath Rupireddy.
Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> writes:
> How about we provide a function to force-drop a replication slot?

Isn't this akin to filing off the safety interlock on the loaded revolver
you keep in your hip pocket?  IMO the entire point of replication slots
is to not make it easy to lose data.

            regards, tom lane



On Thu, Jun 9, 2022 at 11:07 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> Currently postgres doesn't allow dropping a replication slot that's active [1]. This can make certain operations more
time-consumingor stuck in production environments. These operations are - disable async/sync standbys and disable
logicalreplication that require the postgres running on standby or the subscriber to go down. If stopping postgres
servertakes time, the VM or container will have to be killed forcefully which can take a considerable amount of time as
thereare many layers in between. 
>

Why do you want to drop the slot when the server is going down? Is it
some temporary replication slot, otherwise, how will you resume
replication after restarting the server?

--
With Regards,
Amit Kapila.



On Thu, Jun 9, 2022 at 11:24 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> writes:
> > How about we provide a function to force-drop a replication slot?
>
> Isn't this akin to filing off the safety interlock on the loaded revolver
> you keep in your hip pocket?  IMO the entire point of replication slots
> is to not make it easy to lose data.

Agree. How about making the function superuser-only?

Regards,
Bharath Rupireddy.



On Thu, Jun 9, 2022 at 12:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jun 9, 2022 at 11:07 AM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > Currently postgres doesn't allow dropping a replication slot that's active [1]. This can make certain operations
moretime-consuming or stuck in production environments. These operations are - disable async/sync standbys and disable
logicalreplication that require the postgres running on standby or the subscriber to go down. If stopping postgres
servertakes time, the VM or container will have to be killed forcefully which can take a considerable amount of time as
thereare many layers in between. 
> >
>
> Why do you want to drop the slot when the server is going down? Is it
> some temporary replication slot, otherwise, how will you resume
> replication after restarting the server?

The setup is this - primary, bunch of sync standbys, bunch of read
replicas (async standbys), bunch of logical replication subscribers -
now, the user wants to remove any of them for whatever reasons,
typical flow is to first stop the server, if stopping the server takes
time (for instance the standbys or subscribers lag behind the primary
by too much), kill the VM/host server to make the corresponding
replication slots inactive on the primary and then drop the
replication slots. The proposed  force-drop function helps speed up
these operations in production environments and it will also be
possible to provide an SLA for these disable operations.

I hope the user case is clear.

Regards,
Bharath Rupireddy.



Hi,

Why couldn't you terminate the active_pid associated with the slot you 
want to drop if it's active prior to dropping?


On 6/10/22 3:03 AM, Bharath Rupireddy wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you
canconfirm the sender and know the content is safe.
 
>
>
>
> On Thu, Jun 9, 2022 at 12:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Thu, Jun 9, 2022 at 11:07 AM Bharath Rupireddy
>> <bharath.rupireddyforpostgres@gmail.com> wrote:
>>> Currently postgres doesn't allow dropping a replication slot that's active [1]. This can make certain operations
moretime-consuming or stuck in production environments. These operations are - disable async/sync standbys and disable
logicalreplication that require the postgres running on standby or the subscriber to go down. If stopping postgres
servertakes time, the VM or container will have to be killed forcefully which can take a considerable amount of time as
thereare many layers in between.
 
>>>
>> Why do you want to drop the slot when the server is going down? Is it
>> some temporary replication slot, otherwise, how will you resume
>> replication after restarting the server?
> The setup is this - primary, bunch of sync standbys, bunch of read
> replicas (async standbys), bunch of logical replication subscribers -
> now, the user wants to remove any of them for whatever reasons,
> typical flow is to first stop the server, if stopping the server takes
> time (for instance the standbys or subscribers lag behind the primary
> by too much), kill the VM/host server to make the corresponding
> replication slots inactive on the primary and then drop the
> replication slots. The proposed  force-drop function helps speed up
> these operations in production environments and it will also be
> possible to provide an SLA for these disable operations.
>
> I hope the user case is clear.
>
> Regards,
> Bharath Rupireddy.
>
>



On Fri, Jun 10, 2022 at 8:42 PM Hsu, John <hsuchen@amazon.com> wrote:
>
> Hi,
>
> Why couldn't you terminate the active_pid associated with the slot you
> want to drop if it's active prior to dropping?

In that case, the slot becomes active immediately after killing the
old walsender because the standby/subscriber opens another connection
with the primary using the same replication slot. The replication slot
will be inactive for a moment during pg_terminate_backend and becomes
active again by the time we call pg_drop_replication_slot and we hit
the same ERROR:  replication slot "foo" is active for PID XXXXX.

The idea proposed here is to have a force-drop function that
terminates the walsender gracefully and drops the replication slot
even though there's somebody using it and all of this is done with an
exclusive lock on the slot so that nobody can acquire it while we are
dropping it.

Regards,
Bharath Rupireddy.