Обсуждение: Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

Поиск
Список
Период
Сортировка

Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

От
Bharath Rupireddy
Дата:
On Thu, Dec 23, 2021 at 5:53 AM SATYANARAYANA NARLAPURAM
<satyanarlapuram@gmail.com> wrote:
>
> Hi Hackers,
>
> I am considering implementing RPO (recovery point objective) enforcement feature for Postgres where the WAL writes on
theprimary are stalled when the WAL distance between the primary and standby exceeds the configured
(replica_lag_in_bytes)threshold. This feature is useful particularly in the disaster recovery setups where primary and
standbyare in different regions and synchronous replication can't be set up for latency and performance reasons yet
requiressome level of RPO enforcement. 

+1 for the idea in general. However, blocking writes on primary seems
an extremely radical idea. The replicas can fall behind transiently at
times and blocking writes on the primary may stop applications failing
for these transient times. This is not a problem if the applications
have retry logic for the writes. How about blocking writes on primary
if the replicas fall behind the primary for a certain period of time?

> The idea here is to calculate the lag between the primary and the standby (Async?) server during XLogInsert and block
thecaller until the lag is less than the threshold value. We can calculate the max lag by iterating over
ReplicationSlotCtl->replication_slots.

The "falling behind" can also be quantified by the number of
write-transactions on the primary. I think it's good to have the users
choose what the "falling behind" means for them. We can have something
like the "recovery_target" param with different options name, xid,
time, lsn.

> If this is not something we don't want to do in the core, at least adding a hook for XlogInsert is of great value.

IMHO, this feature may not be needed by everyone, the hook-way seems
reasonable so that the postgres vendors can provide different
implementations (for instance they can write an extension that
implements this hook which can block writes on primary, write some log
messages, inform some service layer of the replicas falling behind the
primary etc.). If we were to have the hook in XLogInsert which gets
called so frequently or XLogInsert is a hot-path, the hook really
should do as little work as possible, otherwise the write operations
latency may increase.

> A few other scenarios I can think of with the hook are:
>
> Enforcing RPO as described above
> Enforcing rate limit and slow throttling when sync standby is falling behind (could be flush lag or replay lag)
> Transactional log rate governance - useful for cloud providers to provide SKU sizes based on allowed WAL writes.
>
> Thoughts?

The hook can help to achieve the above objectives but where to place
it and what parameters it should take as input (or what info it should
emit out of the server via the hook) are important too.

Having said all, the RPO feature can also be implemented outside of
the postgres, a simple implementation could be - get the primary
current wal lsn using pg_current_wal_lsn and all the replicas
restart_lsn using pg_replication_slot, if they differ by certain
amount, then issue ALTER SYSTEM SET READ ONLY command [1] on the
primary, this requires the connections to the server and proper access
rights. This feature can also be implemented as an extension (without
the hook) which doesn't require any connections to the server yet can
access the required info primary current_wal_lsn, restart_lsn of the
replication slots etc, but the RPO enforcement may not be immediate as
the server doesn't have any hooks in XLogInsert or some other area.

[1] - https://www.postgresql.org/message-id/CAAJ_b967uKBiW6gbHr5aPzweURYjEGv333FHVHxvJmMhanwHXA%40mail.gmail.com

Regards,
Bharath Rupireddy.



Fwd: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

От
SATYANARAYANA NARLAPURAM
Дата:
Please find the attached draft patch.

On Thu, Dec 23, 2021 at 2:47 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
On Thu, Dec 23, 2021 at 5:53 AM SATYANARAYANA NARLAPURAM
<satyanarlapuram@gmail.com> wrote:
>
> Hi Hackers,
>
> I am considering implementing RPO (recovery point objective) enforcement feature for Postgres where the WAL writes on the primary are stalled when the WAL distance between the primary and standby exceeds the configured (replica_lag_in_bytes) threshold. This feature is useful particularly in the disaster recovery setups where primary and standby are in different regions and synchronous replication can't be set up for latency and performance reasons yet requires some level of RPO enforcement.

+1 for the idea in general. However, blocking writes on primary seems
an extremely radical idea. The replicas can fall behind transiently at
times and blocking writes on the primary may stop applications failing
for these transient times. This is not a problem if the applications
have retry logic for the writes. How about blocking writes on primary
if the replicas fall behind the primary for a certain period of time?

My proposal is to block the caller from writing until the lag situation is improved. Don't want to throw any errors and fail the tranaction. I think we are aligned?
 

> The idea here is to calculate the lag between the primary and the standby (Async?) server during XLogInsert and block the caller until the lag is less than the threshold value. We can calculate the max lag by iterating over ReplicationSlotCtl->replication_slots.

The "falling behind" can also be quantified by the number of
write-transactions on the primary. I think it's good to have the users
choose what the "falling behind" means for them. We can have something
like the "recovery_target" param with different options name, xid,
time, lsn.

The transactions can be of arbitrary size and length and these options may not provide the desired results. Time is a worthy option to add.
 

> If this is not something we don't want to do in the core, at least adding a hook for XlogInsert is of great value.

IMHO, this feature may not be needed by everyone, the hook-way seems
reasonable so that the postgres vendors can provide different
implementations (for instance they can write an extension that
implements this hook which can block writes on primary, write some log
messages, inform some service layer of the replicas falling behind the
primary etc.). If we were to have the hook in XLogInsert which gets
called so frequently or XLogInsert is a hot-path, the hook really
should do as little work as possible, otherwise the write operations
latency may increase.

A Hook is a good start. If there is enough interest then an extension can be added to the contrib module.


> A few other scenarios I can think of with the hook are:
>
> Enforcing RPO as described above
> Enforcing rate limit and slow throttling when sync standby is falling behind (could be flush lag or replay lag)
> Transactional log rate governance - useful for cloud providers to provide SKU sizes based on allowed WAL writes.
>
> Thoughts?

The hook can help to achieve the above objectives but where to place
it and what parameters it should take as input (or what info it should
emit out of the server via the hook) are important too.

XLogInsert in my opinion is the best place to call it and the hook can be something like this "void xlog_insert_hook(NULL)" as all the throttling logic required is the current flush position which can be obtained from GetFlushRecPtr and the ReplicationSlotCtl. Attached a draft patch.
 

Having said all, the RPO feature can also be implemented outside of
the postgres, a simple implementation could be - get the primary
current wal lsn using pg_current_wal_lsn and all the replicas
restart_lsn using pg_replication_slot, if they differ by certain
amount, then issue ALTER SYSTEM SET READ ONLY command [1] on the
primary, this requires the connections to the server and proper access
rights. This feature can also be implemented as an extension (without
the hook) which doesn't require any connections to the server yet can
access the required info primary current_wal_lsn, restart_lsn of the
replication slots etc, but the RPO enforcement may not be immediate as
the server doesn't have any hooks in XLogInsert or some other area.
 
READ ONLY is a decent choice but can fail the writes or not take into effect until the end of the transaction? 
 
[1] - https://www.postgresql.org/message-id/CAAJ_b967uKBiW6gbHr5aPzweURYjEGv333FHVHxvJmMhanwHXA%40mail.gmail.com

Regards,
Bharath Rupireddy.
Вложения

Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

От
Dilip Kumar
Дата:
On Fri, Dec 24, 2021 at 3:27 AM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote:


XLogInsert in my opinion is the best place to call it and the hook can be something like this "void xlog_insert_hook(NULL)" as all the throttling logic required is the current flush position which can be obtained from GetFlushRecPtr and the ReplicationSlotCtl. Attached a draft patch.

IMHO, it is not a good idea to call an external hook function inside a critical section.  Generally, we ensure that we do not call any code path within a critical section which can throw an error and if we start calling the external hook then we lose that control.  It should be blocked at the operation level itself e.g. ALTER TABLE READ ONLY, or by some other hook at a little higher level.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

От
Bharath Rupireddy
Дата:
On Fri, Dec 24, 2021 at 4:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Dec 24, 2021 at 3:27 AM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote:
>>
>> XLogInsert in my opinion is the best place to call it and the hook can be something like this "void
xlog_insert_hook(NULL)"as all the throttling logic required is the current flush position which can be obtained from
GetFlushRecPtrand the ReplicationSlotCtl. Attached a draft patch. 
>
> IMHO, it is not a good idea to call an external hook function inside a critical section.  Generally, we ensure that
wedo not call any code path within a critical section which can throw an error and if we start calling the external
hookthen we lose that control.  It should be blocked at the operation level itself e.g. ALTER TABLE READ ONLY, or by
someother hook at a little higher level. 

Yeah, good point. It's not advisable to give the control to the
external module in the critical section. For instance, memory
allocation isn't allowed (see [1]) and the ereport(ERROR,....) would
transform to PANIC inside the critical section (see [2], [3]).
Moreover the critical section is to be short-spanned i.e. executing
the as minimal code as possible. There's no guarantee that an external
module would follow these.

I suggest we do it at the level of transaction start i.e. when a txnid
is getting allocated i.e. in AssignTransactionId(). If we do this,
when the limit for the throttling is exceeded, the current txn (even
if it is a long running txn) continues to do the WAL insertions, the
next txns would get blocked. But this is okay and can be conveyed to
the users via documentation if need be. We do block txnid assignments
for parallel workers in this function, so this is a good choice IMO.

Thoughts?

[1]
/*
 * You should not do memory allocations within a critical section, because
 * an out-of-memory error will be escalated to a PANIC. To enforce that
 * rule, the allocation functions Assert that.
 */
#define AssertNotInCriticalSection(context) \
    Assert(CritSectionCount == 0 || (context)->allowInCritSection)

[2]
        /*
         * If we are inside a critical section, all errors become PANIC
         * errors.  See miscadmin.h.
         */
        if (CritSectionCount > 0)
            elevel = PANIC;

[3]
 * A related, but conceptually distinct, mechanism is the "critical section"
 * mechanism.  A critical section not only holds off cancel/die interrupts,
 * but causes any ereport(ERROR) or ereport(FATAL) to become ereport(PANIC)
 * --- that is, a system-wide reset is forced.  Needless to say, only really
 * *critical* code should be marked as a critical section!  Currently, this
 * mechanism is only used for XLOG-related code.

Regards,
Bharath Rupireddy.



Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

От
SATYANARAYANA NARLAPURAM
Дата:


On Fri, Dec 24, 2021 at 3:13 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Fri, Dec 24, 2021 at 3:27 AM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote:


XLogInsert in my opinion is the best place to call it and the hook can be something like this "void xlog_insert_hook(NULL)" as all the throttling logic required is the current flush position which can be obtained from GetFlushRecPtr and the ReplicationSlotCtl. Attached a draft patch.

IMHO, it is not a good idea to call an external hook function inside a critical section.  Generally, we ensure that we do not call any code path within a critical section which can throw an error and if we start calling the external hook then we lose that control. 

Thank you for the comment. XLogInsertRecord is inside a critical section but not XLogInsert. Am I missing something?
 
It should be blocked at the operation level itself e.g. ALTER TABLE READ ONLY, or by some other hook at a little higher level.
 
There is a lot of maintenance overhead with a custom implementation at individual databases and tables level. This doesn't provide the necessary control that I am looking for.


 

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

От
Dilip Kumar
Дата:
On Sun, Dec 26, 2021 at 3:52 AM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote:


On Fri, Dec 24, 2021 at 3:13 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Fri, Dec 24, 2021 at 3:27 AM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote:


XLogInsert in my opinion is the best place to call it and the hook can be something like this "void xlog_insert_hook(NULL)" as all the throttling logic required is the current flush position which can be obtained from GetFlushRecPtr and the ReplicationSlotCtl. Attached a draft patch.

IMHO, it is not a good idea to call an external hook function inside a critical section.  Generally, we ensure that we do not call any code path within a critical section which can throw an error and if we start calling the external hook then we lose that control. 

Thank you for the comment. XLogInsertRecord is inside a critical section but not XLogInsert. Am I missing something?

Actually all the WAL insertions are done under a critical section (except few exceptions), that means if you see all the references of XLogInsert(), it is always called under the critical section and that is my main worry about hooking at XLogInsert level.
 
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

От
SATYANARAYANA NARLAPURAM
Дата:


On Sat, Dec 25, 2021 at 6:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Sun, Dec 26, 2021 at 3:52 AM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote:


On Fri, Dec 24, 2021 at 3:13 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Fri, Dec 24, 2021 at 3:27 AM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote:


XLogInsert in my opinion is the best place to call it and the hook can be something like this "void xlog_insert_hook(NULL)" as all the throttling logic required is the current flush position which can be obtained from GetFlushRecPtr and the ReplicationSlotCtl. Attached a draft patch.

IMHO, it is not a good idea to call an external hook function inside a critical section.  Generally, we ensure that we do not call any code path within a critical section which can throw an error and if we start calling the external hook then we lose that control. 

Thank you for the comment. XLogInsertRecord is inside a critical section but not XLogInsert. Am I missing something?

Actually all the WAL insertions are done under a critical section (except few exceptions), that means if you see all the references of XLogInsert(), it is always called under the critical section and that is my main worry about hooking at XLogInsert level.

Got it, understood the concern. But can we document the limitations of the hook and let the hook take care of it? I don't expect an error to be thrown here since we are not planning to allocate memory or make file system calls but instead look at the shared memory state and add delays when required.

 
 
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

От
Julien Rouhaud
Дата:
On Sun, Dec 26, 2021 at 1:06 PM SATYANARAYANA NARLAPURAM
<satyanarlapuram@gmail.com> wrote:
>
> Got it, understood the concern. But can we document the limitations of the hook and let the hook take care of it? I
don'texpect an error to be thrown here since we are not planning to allocate memory or make file system calls but
insteadlook at the shared memory state and add delays when required. 

It wouldn't work.  You can't make any assumption about how long it
would take for the replication lag to resolve, so you may have to wait
for a very long time.  It means that at the very least the sleep has
to be interruptible and therefore can raise an error.  In general
there isn't much you can due in a critical section, so this approach
doesn't seem sensible to me.



Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

От
Dilip Kumar
Дата:
On Sun, Dec 26, 2021 at 10:36 AM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote:

Actually all the WAL insertions are done under a critical section (except few exceptions), that means if you see all the references of XLogInsert(), it is always called under the critical section and that is my main worry about hooking at XLogInsert level.

Got it, understood the concern. But can we document the limitations of the hook and let the hook take care of it? I don't expect an error to be thrown here since we are not planning to allocate memory or make file system calls but instead look at the shared memory state and add delays when required.


Yet another problem is that if we are in XlogInsert() that means we are holding the buffer locks on all the pages we have modified, so if we add a hook at that level which can make it wait then we would also block any of the read operations needed to read from those buffers.  I haven't thought what could be better way to do this but this is certainly not good.

 
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

От
SATYANARAYANA NARLAPURAM
Дата:


On Sat, Dec 25, 2021 at 9:25 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Sun, Dec 26, 2021 at 10:36 AM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote:

Actually all the WAL insertions are done under a critical section (except few exceptions), that means if you see all the references of XLogInsert(), it is always called under the critical section and that is my main worry about hooking at XLogInsert level.

Got it, understood the concern. But can we document the limitations of the hook and let the hook take care of it? I don't expect an error to be thrown here since we are not planning to allocate memory or make file system calls but instead look at the shared memory state and add delays when required.


Yet another problem is that if we are in XlogInsert() that means we are holding the buffer locks on all the pages we have modified, so if we add a hook at that level which can make it wait then we would also block any of the read operations needed to read from those buffers.  I haven't thought what could be better way to do this but this is certainly not good.

Yes, this is a problem. The other approach is adding a hook at XLogWrite/XLogFlush? All the other backends will be waiting behind the WALWriteLock. The process that is performing the write enters into a busy loop with small delays until the criteria are met. Inability to process the interrupts inside the critical section is a challenge in both approaches. Any other thoughts?
 

 
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

От
Stephen Frost
Дата:
Greetings,

* SATYANARAYANA NARLAPURAM (satyanarlapuram@gmail.com) wrote:
> On Sat, Dec 25, 2021 at 9:25 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > On Sun, Dec 26, 2021 at 10:36 AM SATYANARAYANA NARLAPURAM <
> > satyanarlapuram@gmail.com> wrote:
> >>> Actually all the WAL insertions are done under a critical section
> >>> (except few exceptions), that means if you see all the references of
> >>> XLogInsert(), it is always called under the critical section and that is my
> >>> main worry about hooking at XLogInsert level.
> >>>
> >>
> >> Got it, understood the concern. But can we document the limitations of
> >> the hook and let the hook take care of it? I don't expect an error to be
> >> thrown here since we are not planning to allocate memory or make file
> >> system calls but instead look at the shared memory state and add delays
> >> when required.
> >>
> >>
> > Yet another problem is that if we are in XlogInsert() that means we are
> > holding the buffer locks on all the pages we have modified, so if we add a
> > hook at that level which can make it wait then we would also block any of
> > the read operations needed to read from those buffers.  I haven't thought
> > what could be better way to do this but this is certainly not good.
> >
>
> Yes, this is a problem. The other approach is adding a hook at
> XLogWrite/XLogFlush? All the other backends will be waiting behind the
> WALWriteLock. The process that is performing the write enters into a busy
> loop with small delays until the criteria are met. Inability to process the
> interrupts inside the critical section is a challenge in both approaches.
> Any other thoughts?

Why not have this work the exact same way sync replicas do, except that
it's based off of some byte/time lag for some set of async replicas?
That is, in RecordTransactionCommit(), perhaps right after the
SyncRepWaitForLSN() call, or maybe even add this to that function?  Sure
seems like there's a lot of similarity.

Thanks,

Stephen

Вложения

Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

От
SATYANARAYANA NARLAPURAM
Дата:
Stephen, thank you!

On Wed, Dec 29, 2021 at 5:46 AM Stephen Frost <sfrost@snowman.net> wrote:
Greetings,

* SATYANARAYANA NARLAPURAM (satyanarlapuram@gmail.com) wrote:
> On Sat, Dec 25, 2021 at 9:25 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > On Sun, Dec 26, 2021 at 10:36 AM SATYANARAYANA NARLAPURAM <
> > satyanarlapuram@gmail.com> wrote:
> >>> Actually all the WAL insertions are done under a critical section
> >>> (except few exceptions), that means if you see all the references of
> >>> XLogInsert(), it is always called under the critical section and that is my
> >>> main worry about hooking at XLogInsert level.
> >>>
> >>
> >> Got it, understood the concern. But can we document the limitations of
> >> the hook and let the hook take care of it? I don't expect an error to be
> >> thrown here since we are not planning to allocate memory or make file
> >> system calls but instead look at the shared memory state and add delays
> >> when required.
> >>
> >>
> > Yet another problem is that if we are in XlogInsert() that means we are
> > holding the buffer locks on all the pages we have modified, so if we add a
> > hook at that level which can make it wait then we would also block any of
> > the read operations needed to read from those buffers.  I haven't thought
> > what could be better way to do this but this is certainly not good.
> >
>
> Yes, this is a problem. The other approach is adding a hook at
> XLogWrite/XLogFlush? All the other backends will be waiting behind the
> WALWriteLock. The process that is performing the write enters into a busy
> loop with small delays until the criteria are met. Inability to process the
> interrupts inside the critical section is a challenge in both approaches.
> Any other thoughts?

Why not have this work the exact same way sync replicas do, except that
it's based off of some byte/time lag for some set of async replicas?
That is, in RecordTransactionCommit(), perhaps right after the
SyncRepWaitForLSN() call, or maybe even add this to that function?  Sure
seems like there's a lot of similarity.

I was thinking of achieving log governance (throttling WAL MB/sec) and also providing RPO guarantees. In this model, it is hard to throttle WAL generation of a long running transaction (for example copy/select into). However, this meets my RPO needs. Are you in support of adding a hook or the actual change? IMHO, the hook allows more creative options. I can go ahead and make a patch accordingly.


 
Thanks,

Stephen

Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

От
Stephen Frost
Дата:
Greetings,

On Wed, Dec 29, 2021 at 14:04 SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote:
Stephen, thank you!

On Wed, Dec 29, 2021 at 5:46 AM Stephen Frost <sfrost@snowman.net> wrote:
Greetings,

* SATYANARAYANA NARLAPURAM (satyanarlapuram@gmail.com) wrote:
> On Sat, Dec 25, 2021 at 9:25 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > On Sun, Dec 26, 2021 at 10:36 AM SATYANARAYANA NARLAPURAM <
> > satyanarlapuram@gmail.com> wrote:
> >>> Actually all the WAL insertions are done under a critical section
> >>> (except few exceptions), that means if you see all the references of
> >>> XLogInsert(), it is always called under the critical section and that is my
> >>> main worry about hooking at XLogInsert level.
> >>>
> >>
> >> Got it, understood the concern. But can we document the limitations of
> >> the hook and let the hook take care of it? I don't expect an error to be
> >> thrown here since we are not planning to allocate memory or make file
> >> system calls but instead look at the shared memory state and add delays
> >> when required.
> >>
> >>
> > Yet another problem is that if we are in XlogInsert() that means we are
> > holding the buffer locks on all the pages we have modified, so if we add a
> > hook at that level which can make it wait then we would also block any of
> > the read operations needed to read from those buffers.  I haven't thought
> > what could be better way to do this but this is certainly not good.
> >
>
> Yes, this is a problem. The other approach is adding a hook at
> XLogWrite/XLogFlush? All the other backends will be waiting behind the
> WALWriteLock. The process that is performing the write enters into a busy
> loop with small delays until the criteria are met. Inability to process the
> interrupts inside the critical section is a challenge in both approaches.
> Any other thoughts?

Why not have this work the exact same way sync replicas do, except that
it's based off of some byte/time lag for some set of async replicas?
That is, in RecordTransactionCommit(), perhaps right after the
SyncRepWaitForLSN() call, or maybe even add this to that function?  Sure
seems like there's a lot of similarity.

I was thinking of achieving log governance (throttling WAL MB/sec) and also providing RPO guarantees. In this model, it is hard to throttle WAL generation of a long running transaction (for example copy/select into).

Long running transactions have a lot of downsides and are best discouraged. I don’t know that we should be designing this for that case specifically, particularly given the complications it would introduce as discussed on this thread already.

However, this meets my RPO needs. Are you in support of adding a hook or the actual change? IMHO, the hook allows more creative options. I can go ahead and make a patch accordingly.

I would think this would make more sense as part of core rather than a hook, as that then requires an extension and additional setup to get going, which raises the bar quite a bit when it comes to actually being used.

Thanks,

Stephen

Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

От
SATYANARAYANA NARLAPURAM
Дата:


On Wed, Dec 29, 2021 at 11:16 AM Stephen Frost <sfrost@snowman.net> wrote:
Greetings,

On Wed, Dec 29, 2021 at 14:04 SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote:
Stephen, thank you!

On Wed, Dec 29, 2021 at 5:46 AM Stephen Frost <sfrost@snowman.net> wrote:
Greetings,

* SATYANARAYANA NARLAPURAM (satyanarlapuram@gmail.com) wrote:
> On Sat, Dec 25, 2021 at 9:25 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > On Sun, Dec 26, 2021 at 10:36 AM SATYANARAYANA NARLAPURAM <
> > satyanarlapuram@gmail.com> wrote:
> >>> Actually all the WAL insertions are done under a critical section
> >>> (except few exceptions), that means if you see all the references of
> >>> XLogInsert(), it is always called under the critical section and that is my
> >>> main worry about hooking at XLogInsert level.
> >>>
> >>
> >> Got it, understood the concern. But can we document the limitations of
> >> the hook and let the hook take care of it? I don't expect an error to be
> >> thrown here since we are not planning to allocate memory or make file
> >> system calls but instead look at the shared memory state and add delays
> >> when required.
> >>
> >>
> > Yet another problem is that if we are in XlogInsert() that means we are
> > holding the buffer locks on all the pages we have modified, so if we add a
> > hook at that level which can make it wait then we would also block any of
> > the read operations needed to read from those buffers.  I haven't thought
> > what could be better way to do this but this is certainly not good.
> >
>
> Yes, this is a problem. The other approach is adding a hook at
> XLogWrite/XLogFlush? All the other backends will be waiting behind the
> WALWriteLock. The process that is performing the write enters into a busy
> loop with small delays until the criteria are met. Inability to process the
> interrupts inside the critical section is a challenge in both approaches.
> Any other thoughts?

Why not have this work the exact same way sync replicas do, except that
it's based off of some byte/time lag for some set of async replicas?
That is, in RecordTransactionCommit(), perhaps right after the
SyncRepWaitForLSN() call, or maybe even add this to that function?  Sure
seems like there's a lot of similarity.

I was thinking of achieving log governance (throttling WAL MB/sec) and also providing RPO guarantees. In this model, it is hard to throttle WAL generation of a long running transaction (for example copy/select into).

Long running transactions have a lot of downsides and are best discouraged. I don’t know that we should be designing this for that case specifically, particularly given the complications it would introduce as discussed on this thread already.

However, this meets my RPO needs. Are you in support of adding a hook or the actual change? IMHO, the hook allows more creative options. I can go ahead and make a patch accordingly.

I would think this would make more sense as part of core rather than a hook, as that then requires an extension and additional setup to get going, which raises the bar quite a bit when it comes to actually being used.

Sounds good, I will work on making the changes accordingly.

Thanks,

Stephen

Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

От
Andres Freund
Дата:
Hi,

On 2021-12-27 16:40:28 -0800, SATYANARAYANA NARLAPURAM wrote:
> > Yet another problem is that if we are in XlogInsert() that means we are
> > holding the buffer locks on all the pages we have modified, so if we add a
> > hook at that level which can make it wait then we would also block any of
> > the read operations needed to read from those buffers.  I haven't thought
> > what could be better way to do this but this is certainly not good.
> >
> 
> Yes, this is a problem. The other approach is adding a hook at
> XLogWrite/XLogFlush?

That's pretty much the same - XLogInsert() can trigger an
XLogWrite()/Flush().

I think it's a complete no-go to add throttling to these places. It's quite
possible that it'd cause new deadlocks, and it's almost guaranteed to have
unintended consequences (e.g. replication falling back further because
XLogFlush() is being throttled).

I also don't think it's a sane thing to add hooks to these places. It's
complicated enough as-is, adding the chance for random other things to happen
during such crucial operations will make it even harder to maintain.

Greetings,

Andres Freund



Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

От
SATYANARAYANA NARLAPURAM
Дата:


On Wed, Dec 29, 2021 at 11:31 AM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2021-12-27 16:40:28 -0800, SATYANARAYANA NARLAPURAM wrote:
> > Yet another problem is that if we are in XlogInsert() that means we are
> > holding the buffer locks on all the pages we have modified, so if we add a
> > hook at that level which can make it wait then we would also block any of
> > the read operations needed to read from those buffers.  I haven't thought
> > what could be better way to do this but this is certainly not good.
> >
>
> Yes, this is a problem. The other approach is adding a hook at
> XLogWrite/XLogFlush?

That's pretty much the same - XLogInsert() can trigger an
XLogWrite()/Flush().

I think it's a complete no-go to add throttling to these places. It's quite
possible that it'd cause new deadlocks, and it's almost guaranteed to have
unintended consequences (e.g. replication falling back further because
XLogFlush() is being throttled).

I also don't think it's a sane thing to add hooks to these places. It's
complicated enough as-is, adding the chance for random other things to happen
during such crucial operations will make it even harder to maintain.

Andres, thanks for the comments. Agreed on this based on the previous discussions on this thread. Could you please share your thoughts on adding it after SyncRepWaitForLSN()?
 

Greetings,

Andres Freund

Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

От
Andres Freund
Дата:
Hi,

On 2021-12-29 11:34:53 -0800, SATYANARAYANA NARLAPURAM wrote:
> On Wed, Dec 29, 2021 at 11:31 AM Andres Freund <andres@anarazel.de> wrote:
> Andres, thanks for the comments. Agreed on this based on the previous
> discussions on this thread. Could you please share your thoughts on adding
> it after SyncRepWaitForLSN()?

I don't think that's good either - you're delaying transaction commit
(i.e. xact becoming visible / locks being released). That also has the danger
of increasing lock contention (albeit more likely to be heavyweight locks /
serializable state). It'd have to be after the transaction actually committed.

Greetings,

Andres Freund



Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

От
Dilip Kumar
Дата:
On Thu, Dec 30, 2021 at 1:09 AM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2021-12-29 11:34:53 -0800, SATYANARAYANA NARLAPURAM wrote:
> On Wed, Dec 29, 2021 at 11:31 AM Andres Freund <andres@anarazel.de> wrote:
> Andres, thanks for the comments. Agreed on this based on the previous
> discussions on this thread. Could you please share your thoughts on adding
> it after SyncRepWaitForLSN()?

I don't think that's good either - you're delaying transaction commit
(i.e. xact becoming visible / locks being released).

Agree with that.
 
That also has the danger
of increasing lock contention (albeit more likely to be heavyweight locks /
serializable state). It'd have to be after the transaction actually committed.

Yeah, I think that would make sense, even though we will be allowing a new backend to get connected insert WAL, and get committed but after that, it will be throttled.  However, if the number of max connections will be very high then even after we detected a lag there a significant amount WAL could be generated, even if we keep long-running transactions aside.  But I think still it will serve the purpose of what Satya is trying to achieve.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

От
SATYANARAYANA NARLAPURAM
Дата:


On Wed, Dec 29, 2021 at 10:38 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Thu, Dec 30, 2021 at 1:09 AM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2021-12-29 11:34:53 -0800, SATYANARAYANA NARLAPURAM wrote:
> On Wed, Dec 29, 2021 at 11:31 AM Andres Freund <andres@anarazel.de> wrote:
> Andres, thanks for the comments. Agreed on this based on the previous
> discussions on this thread. Could you please share your thoughts on adding
> it after SyncRepWaitForLSN()?

I don't think that's good either - you're delaying transaction commit
(i.e. xact becoming visible / locks being released).

Agree with that.
 
That also has the danger
of increasing lock contention (albeit more likely to be heavyweight locks /
serializable state). It'd have to be after the transaction actually committed.

Yeah, I think that would make sense, even though we will be allowing a new backend to get connected insert WAL, and get committed but after that, it will be throttled.  However, if the number of max connections will be very high then even after we detected a lag there a significant amount WAL could be generated, even if we keep long-running transactions aside.  But I think still it will serve the purpose of what Satya is trying to achieve.

I am afraid there are problems with making the RPO check post releasing the locks. By this time the transaction is committed and visible to the other backends (ProcArrayEndTransaction is already called) though the intention is to block committing transactions that violate the defined RPO. Even though we block existing connections starting a new transaction, it is possible to do writes by opening a new connection / canceling the query. I am not too much worried about the lock contention as the system is already hosed because of the policy. This behavior is very similar to what happens when the Sync standby is not responding. Thoughts?


 

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

От
Dilip Kumar
Дата:
On Thu, Dec 30, 2021 at 12:36 PM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote:

Yeah, I think that would make sense, even though we will be allowing a new backend to get connected insert WAL, and get committed but after that, it will be throttled.  However, if the number of max connections will be very high then even after we detected a lag there a significant amount WAL could be generated, even if we keep long-running transactions aside.  But I think still it will serve the purpose of what Satya is trying to achieve.

I am afraid there are problems with making the RPO check post releasing the locks. By this time the transaction is committed and visible to the other backends (ProcArrayEndTransaction is already called) though the intention is to block committing transactions that violate the defined RPO. Even though we block existing connections starting a new transaction, it is possible to do writes by opening a new connection / canceling the query. I am not too much worried about the lock contention as the system is already hosed because of the policy. This behavior is very similar to what happens when the Sync standby is not responding. Thoughts?

Yeah, that's true, but even if we are blocking the transactions from committing then also it is possible that a new connection can come and generate more WAL,  yeah but I agree with the other part that if you throttle after committing then the user can cancel the queries and generate more WAL from those sessions as well.  But that is an extreme case where application developers want to bypass the throttling and want to generate more WALs.   

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

От
Bharath Rupireddy
Дата:
On Thu, Dec 30, 2021 at 1:21 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Dec 30, 2021 at 12:36 PM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote:
>>>
>>>
>>> Yeah, I think that would make sense, even though we will be allowing a new backend to get connected insert WAL, and
getcommitted but after that, it will be throttled.  However, if the number of max connections will be very high then
evenafter we detected a lag there a significant amount WAL could be generated, even if we keep long-running
transactionsaside.  But I think still it will serve the purpose of what Satya is trying to achieve. 
>>
>>
>> I am afraid there are problems with making the RPO check post releasing the locks. By this time the transaction is
committedand visible to the other backends (ProcArrayEndTransaction is already called) though the intention is to block
committingtransactions that violate the defined RPO. Even though we block existing connections starting a new
transaction,it is possible to do writes by opening a new connection / canceling the query. I am not too much worried
aboutthe lock contention as the system is already hosed because of the policy. This behavior is very similar to what
happenswhen the Sync standby is not responding. Thoughts? 
>
>
> Yeah, that's true, but even if we are blocking the transactions from committing then also it is possible that a new
connectioncan come and generate more WAL,  yeah but I agree with the other part that if you throttle after committing
thenthe user can cancel the queries and generate more WAL from those sessions as well.  But that is an extreme case
whereapplication developers want to bypass the throttling and want to generate more WALs. 

How about having the new hook at the start of the new txn?  If we do
this, when the limit for the throttling is exceeded, the current txn
(even if it is a long running one) continues to do the WAL insertions,
the next txns would get blocked. Thoughts?

Regards,
Bharath Rupireddy.



Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

От
Dilip Kumar
Дата:
On Thu, Dec 30, 2021 at 1:41 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:

>
> Yeah, that's true, but even if we are blocking the transactions from committing then also it is possible that a new connection can come and generate more WAL,  yeah but I agree with the other part that if you throttle after committing then the user can cancel the queries and generate more WAL from those sessions as well.  But that is an extreme case where application developers want to bypass the throttling and want to generate more WALs.

How about having the new hook at the start of the new txn?  If we do
this, when the limit for the throttling is exceeded, the current txn
(even if it is a long running one) continues to do the WAL insertions,
the next txns would get blocked. Thoughts?

Do you mean while StartTransactionCommand or while assigning a new transaction id?  If it is at StartTransactionCommand then we would be blocking the sessions which are only performing read queries right?  If we are doing at the transaction assignment level then we might be holding some of the locks so this might not be any better than throttling inside the commit.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

От
SATYANARAYANA NARLAPURAM
Дата:


On Thu, Dec 30, 2021 at 12:20 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Thu, Dec 30, 2021 at 1:41 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:

>
> Yeah, that's true, but even if we are blocking the transactions from committing then also it is possible that a new connection can come and generate more WAL,  yeah but I agree with the other part that if you throttle after committing then the user can cancel the queries and generate more WAL from those sessions as well.  But that is an extreme case where application developers want to bypass the throttling and want to generate more WALs.

How about having the new hook at the start of the new txn?  If we do
this, when the limit for the throttling is exceeded, the current txn
(even if it is a long running one) continues to do the WAL insertions,
the next txns would get blocked. Thoughts?

Do you mean while StartTransactionCommand or while assigning a new transaction id? If it is at StartTransactionCommand then we would be blocking the sessions which are only performing read queries right? 

Definitely not at StartTransactionCommand but possibly while assigning transaction Id inAssignTransactionId. Blocking readers is never the intent.
 
If we are doing at the transaction assignment level then we might be holding some of the locks so this might not be any better than throttling inside the commit.

If we define RPO as no transaction can commit when the wal_distance is more than configured MB, we had to throttle the writes before committing the transaction and new WAL generation by new connections or active doesn't matter as the transactions can't be committed and visible to the user. If the RPO is defined as no new write transactions allowed when wal_distance > configured MB, then we can block assigning the new transaction IDs until the RPO policy is met. IMHO, following the sync replication semantics is easier and more explainable as it is already familiar to the customers.
 

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

От
Andres Freund
Дата:
Hi,

On 2021-12-29 23:06:31 -0800, SATYANARAYANA NARLAPURAM wrote:
> I am afraid there are problems with making the RPO check post releasing the
> locks. By this time the transaction is committed and visible to the other
> backends (ProcArrayEndTransaction is already called) though the intention
> is to block committing transactions that violate the defined RPO.

Shrug. Anything transaction based has way bigger holes than this.


> Even though we block existing connections starting a new transaction, it is
> possible to do writes by opening a new connection / canceling the query.

If your threat model is users explicitly trying to circumvent this they can
cause problems much more easily. Trigger a bunch of vacuums, big COPYs etc.


> I am not too much worried about the lock contention as the system is already
> hosed because of the policy. This behavior is very similar to what happens
> when the Sync standby is not responding. Thoughts?

I don't see why we'd bury ourselves deeper in problems just because we already
have a problem. There's reasons why we want to do the delay for syncrep be
before xact completion - but I don't see those applying to WAL throttling to a
significant degree, particularly not when it's on a transaction level.

Greetings,

Andres Freund



Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

От
Andres Freund
Дата:
Hi,

On 2021-12-29 11:31:51 -0800, Andres Freund wrote:
> That's pretty much the same - XLogInsert() can trigger an
> XLogWrite()/Flush().
> 
> I think it's a complete no-go to add throttling to these places. It's quite
> possible that it'd cause new deadlocks, and it's almost guaranteed to have
> unintended consequences (e.g. replication falling back further because
> XLogFlush() is being throttled).

I thought of another way to implement this feature. What if we checked the
current distance somewhere within XLogInsert(), but only set
InterruptPending=true, XLogDelayPending=true. Then in ProcessInterrupts() we
check if XLogDelayPending is true and sleep the appropriate time.

That way the sleep doesn't happen with important locks held / within a
critical section, but we still delay close to where we went over the maximum
lag. And the overhead should be fairly minimal.


I'm doubtful that implementing the waits on a transactional level provides a
meaningful enough amount of control - there's just too much WAL that can be
generated within a transaction.

Greetings,

Andres Freund



On Wed, Jan 5, 2022 at 11:16 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2021-12-29 11:31:51 -0800, Andres Freund wrote:
> > That's pretty much the same - XLogInsert() can trigger an
> > XLogWrite()/Flush().
> >
> > I think it's a complete no-go to add throttling to these places. It's quite
> > possible that it'd cause new deadlocks, and it's almost guaranteed to have
> > unintended consequences (e.g. replication falling back further because
> > XLogFlush() is being throttled).
>
> I thought of another way to implement this feature. What if we checked the
> current distance somewhere within XLogInsert(), but only set
> InterruptPending=true, XLogDelayPending=true. Then in ProcessInterrupts() we
> check if XLogDelayPending is true and sleep the appropriate time.
>
> That way the sleep doesn't happen with important locks held / within a
> critical section, but we still delay close to where we went over the maximum
> lag. And the overhead should be fairly minimal.

+1, this sounds like a really good idea to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

От
SATYANARAYANA NARLAPURAM
Дата:


On Wed, Jan 5, 2022 at 9:46 AM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2021-12-29 11:31:51 -0800, Andres Freund wrote:
> That's pretty much the same - XLogInsert() can trigger an
> XLogWrite()/Flush().
>
> I think it's a complete no-go to add throttling to these places. It's quite
> possible that it'd cause new deadlocks, and it's almost guaranteed to have
> unintended consequences (e.g. replication falling back further because
> XLogFlush() is being throttled).

I thought of another way to implement this feature. What if we checked the
current distance somewhere within XLogInsert(), but only set
InterruptPending=true, XLogDelayPending=true. Then in ProcessInterrupts() we
check if XLogDelayPending is true and sleep the appropriate time.

That way the sleep doesn't happen with important locks held / within a
critical section, but we still delay close to where we went over the maximum
lag. And the overhead should be fairly minimal.

+1 to the idea, this way we can fairly throttle large and smaller transactions the same way. I will work on this model and share the patch. Please note that the lock contention still exists in this case.
 
I'm doubtful that implementing the waits on a transactional level provides a
meaningful enough amount of control - there's just too much WAL that can be
generated within a transaction.
 

Greetings,

Andres Freund
On Thu, Jan 6, 2022 at 11:27 AM SATYANARAYANA NARLAPURAM
<satyanarlapuram@gmail.com> wrote:

> On Wed, Jan 5, 2022 at 9:46 AM Andres Freund <andres@anarazel.de> wrote:
>>
>> Hi,
>>
>> On 2021-12-29 11:31:51 -0800, Andres Freund wrote:
>> > That's pretty much the same - XLogInsert() can trigger an
>> > XLogWrite()/Flush().
>> >
>> > I think it's a complete no-go to add throttling to these places. It's quite
>> > possible that it'd cause new deadlocks, and it's almost guaranteed to have
>> > unintended consequences (e.g. replication falling back further because
>> > XLogFlush() is being throttled).
>>
>> I thought of another way to implement this feature. What if we checked the
>> current distance somewhere within XLogInsert(), but only set
>> InterruptPending=true, XLogDelayPending=true. Then in ProcessInterrupts() we
>> check if XLogDelayPending is true and sleep the appropriate time.
>>
>> That way the sleep doesn't happen with important locks held / within a
>> critical section, but we still delay close to where we went over the maximum
>> lag. And the overhead should be fairly minimal.
>
>
> +1 to the idea, this way we can fairly throttle large and smaller transactions the same way. I will work on this
modeland share the patch. Please note that the lock contention still exists in this case.
 

Generally while checking for the interrupt we should not be holding
any lwlock that means we don't have the risk of holding any buffer
locks, so any other reader can continue to read from those buffers.
We will only be holding some heavyweight locks like relation/tuple
lock but that will not impact anyone except the writers trying to
update the same tuple or the DDL who want to modify the table
definition so I don't think we have any issue here because anyway we
don't want other writers to continue.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

От
SATYANARAYANA NARLAPURAM
Дата:


On Wed, Jan 5, 2022 at 10:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Thu, Jan 6, 2022 at 11:27 AM SATYANARAYANA NARLAPURAM
<satyanarlapuram@gmail.com> wrote:

> On Wed, Jan 5, 2022 at 9:46 AM Andres Freund <andres@anarazel.de> wrote:
>>
>> Hi,
>>
>> On 2021-12-29 11:31:51 -0800, Andres Freund wrote:
>> > That's pretty much the same - XLogInsert() can trigger an
>> > XLogWrite()/Flush().
>> >
>> > I think it's a complete no-go to add throttling to these places. It's quite
>> > possible that it'd cause new deadlocks, and it's almost guaranteed to have
>> > unintended consequences (e.g. replication falling back further because
>> > XLogFlush() is being throttled).
>>
>> I thought of another way to implement this feature. What if we checked the
>> current distance somewhere within XLogInsert(), but only set
>> InterruptPending=true, XLogDelayPending=true. Then in ProcessInterrupts() we
>> check if XLogDelayPending is true and sleep the appropriate time.
>>
>> That way the sleep doesn't happen with important locks held / within a
>> critical section, but we still delay close to where we went over the maximum
>> lag. And the overhead should be fairly minimal.
>
>
> +1 to the idea, this way we can fairly throttle large and smaller transactions the same way. I will work on this model and share the patch. Please note that the lock contention still exists in this case.

Generally while checking for the interrupt we should not be holding
any lwlock that means we don't have the risk of holding any buffer
locks, so any other reader can continue to read from those buffers.
We will only be holding some heavyweight locks like relation/tuple
lock but that will not impact anyone except the writers trying to
update the same tuple or the DDL who want to modify the table
definition so I don't think we have any issue here because anyway we
don't want other writers to continue.

Yes, it should be ok. I was just making it explicit on Andres' previous comment on holding the heavyweight locks when wait before the commit.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

От
"Bossart, Nathan"
Дата:
I noticed this thread and thought I'd share my experiences building
something similar for Multi-AZ DB clusters [0].  It's not a strict RPO
mechanism, but it does throttle backends in an effort to keep the
replay lag below a configured maximum.  I can share the code if there
is interest.

I wrote it as a new extension, and except for one piece that I'll go
into later, I was able to avoid changes to core PostgreSQL code.  The
extension manages a background worker that periodically assesses the
state of the designated standbys and updates an atomic in shared
memory that indicates how long to delay.  A transaction callback
checks this value and sleeps as necessary.  Delay can be injected for
write-enabled transactions on the primary, read-only transactions on
the standbys, or both.  The extension is heavily configurable so that
it can meet the needs of a variety of workloads.

One interesting challenge I encountered was accurately determining the
amount of replay lag.  The problem was twofold.  First, if there is no
activity on the primary, there will be nothing to replay on the
standbys, so the replay lag will appear to grow unbounded.  To work
around this, the extension's background worker periodically creates an
empty COMMIT record.  Second, if a standby reconnects after a long
time, the replay lag won't be accurate for some time.  Instead, the
replay lag will slowly increase until it reaches the correct value.
Since the delay calculation looks at the trend of the replay lag, this
apparent unbounded growth causes it to inject far more delay than is
necessary.  My guess is that this is related to 9ea3c64, and maybe it
is worth rethinking that logic.  For now, the extension just
periodically reports the value of GetLatestXTime() from the standbys
to the primary to get an accurate reading.  This is done via a new
replication callback mechanism (which requires core PostgreSQL
changes).  I can share this patch along with the extension, as I bet
there are other applications for it.

I should also note that the extension only considers "active" standbys
and primaries.  That is, ones with an active WAL sender or WAL
receiver.  This avoids the need to guess what should be done during a
network partition, but it also means that we must gracefully handle
standbys reconnecting with massive amounts of lag.  The extension is
designed to slowly ramp up the amount of injected delay until the
standby's apply lag is trending down at a sufficient rate.

I see that an approach was suggested upthread for throttling based on
WAL distance instead of per-transaction.  While the transaction
approach works decently well for certain workloads (e.g., many small
transactions like those from pgbench), it might require further tuning
for very large transactions or workloads with a variety of transaction
sizes.  For that reason, I would definitely support building a way to
throttle based on WAL generation.  It might be a good idea to avoid
throttling critical activity such as anti-wraparound vacuuming, too.

Nathan

[0] https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/multi-az-db-clusters-concepts.html


Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

От
Konstantin Knizhnik
Дата:

On 11.01.2022 03:06, Bossart, Nathan wrote:
> I noticed this thread and thought I'd share my experiences building
> something similar for Multi-AZ DB clusters [0].  It's not a strict RPO
> mechanism, but it does throttle backends in an effort to keep the
> replay lag below a configured maximum.  I can share the code if there
> is interest.
>
> I wrote it as a new extension, and except for one piece that I'll go
> into later, I was able to avoid changes to core PostgreSQL code.  The
> extension manages a background worker that periodically assesses the
> state of the designated standbys and updates an atomic in shared
> memory that indicates how long to delay.  A transaction callback
> checks this value and sleeps as necessary.  Delay can be injected for
> write-enabled transactions on the primary, read-only transactions on
> the standbys, or both.  The extension is heavily configurable so that
> it can meet the needs of a variety of workloads.
>
> One interesting challenge I encountered was accurately determining the
> amount of replay lag.  The problem was twofold.  First, if there is no
> activity on the primary, there will be nothing to replay on the
> standbys, so the replay lag will appear to grow unbounded.  To work
> around this, the extension's background worker periodically creates an
> empty COMMIT record.  Second, if a standby reconnects after a long
> time, the replay lag won't be accurate for some time.  Instead, the
> replay lag will slowly increase until it reaches the correct value.
> Since the delay calculation looks at the trend of the replay lag, this
> apparent unbounded growth causes it to inject far more delay than is
> necessary.  My guess is that this is related to 9ea3c64, and maybe it
> is worth rethinking that logic.  For now, the extension just
> periodically reports the value of GetLatestXTime() from the standbys
> to the primary to get an accurate reading.  This is done via a new
> replication callback mechanism (which requires core PostgreSQL
> changes).  I can share this patch along with the extension, as I bet
> there are other applications for it.
>
> I should also note that the extension only considers "active" standbys
> and primaries.  That is, ones with an active WAL sender or WAL
> receiver.  This avoids the need to guess what should be done during a
> network partition, but it also means that we must gracefully handle
> standbys reconnecting with massive amounts of lag.  The extension is
> designed to slowly ramp up the amount of injected delay until the
> standby's apply lag is trending down at a sufficient rate.
>
> I see that an approach was suggested upthread for throttling based on
> WAL distance instead of per-transaction.  While the transaction
> approach works decently well for certain workloads (e.g., many small
> transactions like those from pgbench), it might require further tuning
> for very large transactions or workloads with a variety of transaction
> sizes.  For that reason, I would definitely support building a way to
> throttle based on WAL generation.  It might be a good idea to avoid
> throttling critical activity such as anti-wraparound vacuuming, too.
>
> Nathan
>
> [0] https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/multi-az-db-clusters-concepts.html
>

We have faced with the similar problem in Zenith (open source Aurora) 
and have to implement back pressure mechanism to prevent overflow of WAL 
at stateless compute nodes
and too long delays of [age reconstruction. Our implementation is the 
following:
1. Three GUCs are added: max_replication_write/flush/apply_lag
2. Replication lags are checked in XLogInsert and if one of 3 thresholds 
is reached then InterruptPending is set.
3. In ProcessInterrupts we block backend execution until lag is within 
specified boundary:

     #define BACK_PRESSURE_DELAY 10000L // 0.01 sec
     while(true)
     {
         ProcessInterrupts_pg();

         // Suspend writers until replicas catch up
         lag = backpressure_lag();
         if (lag <= 0)
             break;

         set_ps_display("backpressure throttling");

         elog(DEBUG2, "backpressure throttling: lag %lu", lag);
         pg_usleep(BACK_PRESSURE_DELAY);
     }

What is wrong here is that backend can be blocked for a long time 
(causing failure of client application due to timeout expiration) and 
hold acquired locks while sleeping.
We are thinking about smarter way of choosing throttling delay (for 
example exponential increase of throttling sleep interval until some 
maximal value is reached).
But it is really hard to find some universal schema which will be good 
for all use cases (for example in case of short living session, which 
clients are connected to the server to execute just one query).

Concerning throttling at the end of transaction which eliminates problem 
with holding locks and do not require changes in postgres core, 
unfortunately it doesn't address problem with large transactions (for 
example bulk load of data using COPY). In this case just one transaction 
can cause arbitrary large lag.

I am not sure how critical is the problems with holding locks during 
throttling: yes, it may block other database activity, including vacuum 
and execution of read-only queries.
But it should not block walsender and so cause deadlock. And in most 
cases read-only transactions are not conflicting with write transaction, 
so suspending write transaction
should not block readers.

Another problem with throttling is large WAL records (for example custom 
logical replication WAL record can be arbitrary large). If such record 
is larger than replication lag limit,
then it can cause deadlock.



Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

От
Bharath Rupireddy
Дата:
On Tue, Jan 11, 2022 at 2:11 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
>
> We have faced with the similar problem in Zenith (open source Aurora)
> and have to implement back pressure mechanism to prevent overflow of WAL
> at stateless compute nodes
> and too long delays of [age reconstruction. Our implementation is the
> following:
> 1. Three GUCs are added: max_replication_write/flush/apply_lag
> 2. Replication lags are checked in XLogInsert and if one of 3 thresholds
> is reached then InterruptPending is set.
> 3. In ProcessInterrupts we block backend execution until lag is within
> specified boundary:
>
>      #define BACK_PRESSURE_DELAY 10000L // 0.01 sec
>      while(true)
>      {
>          ProcessInterrupts_pg();
>
>          // Suspend writers until replicas catch up
>          lag = backpressure_lag();
>          if (lag <= 0)
>              break;
>
>          set_ps_display("backpressure throttling");
>
>          elog(DEBUG2, "backpressure throttling: lag %lu", lag);
>          pg_usleep(BACK_PRESSURE_DELAY);
>      }
>
> What is wrong here is that backend can be blocked for a long time
> (causing failure of client application due to timeout expiration) and
> hold acquired locks while sleeping.

Do we ever call CHECK_FOR_INTERRUPTS() while holding "important"
locks? I haven't seen any asserts or anything of that sort in
ProcessInterrupts() though, looks like it's the caller's
responsibility to not process interrupts while holding heavy weight
locks, here are some points on this upthread [1].

I don't think we have problem with various postgres timeouts
statement_timeout, lock_timeout, idle_in_transaction_session_timeout,
idle_session_timeout, client_connection_check_interval, because while
we wait for replication lag to get better in ProcessInterrupts(). I
think SIGALRM can be raised while we wait for replication lag to get
better, but it can't be handled. Why can't we just disable these
timeouts before going to wait and reset/enable right after the
replication lag gets better?

And the clients can always have their own
no-reply-kill-transaction-sort-of-timeout, if yes, let them fail and
deal with it. I don't think we can do much about this.

> We are thinking about smarter way of choosing throttling delay (for
> example exponential increase of throttling sleep interval until some
> maximal value is reached).
> But it is really hard to find some universal schema which will be good
> for all use cases (for example in case of short living session, which
> clients are connected to the server to execute just one query).

I think there has to be an upper limit to wait, perhaps a
'preconfigured amount of time'. I think others upthread aren't happy
with failing transactions because of the replication lag. But, my
point is how much time we would let the backends wait or throttle WAL
writes? It mustn't be forever (say if a broken connection to the async
standby is found).

[1] https://www.postgresql.org/message-id/20220105174643.lozdd3radxv4tlmx%40alap3.anarazel.de

Regards,
Bharath Rupireddy.