Обсуждение: Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes
Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes
От
Bharath Rupireddy
Дата:
On Thu, Dec 23, 2021 at 5:53 AM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote: > > Hi Hackers, > > I am considering implementing RPO (recovery point objective) enforcement feature for Postgres where the WAL writes on theprimary are stalled when the WAL distance between the primary and standby exceeds the configured (replica_lag_in_bytes)threshold. This feature is useful particularly in the disaster recovery setups where primary and standbyare in different regions and synchronous replication can't be set up for latency and performance reasons yet requiressome level of RPO enforcement. +1 for the idea in general. However, blocking writes on primary seems an extremely radical idea. The replicas can fall behind transiently at times and blocking writes on the primary may stop applications failing for these transient times. This is not a problem if the applications have retry logic for the writes. How about blocking writes on primary if the replicas fall behind the primary for a certain period of time? > The idea here is to calculate the lag between the primary and the standby (Async?) server during XLogInsert and block thecaller until the lag is less than the threshold value. We can calculate the max lag by iterating over ReplicationSlotCtl->replication_slots. The "falling behind" can also be quantified by the number of write-transactions on the primary. I think it's good to have the users choose what the "falling behind" means for them. We can have something like the "recovery_target" param with different options name, xid, time, lsn. > If this is not something we don't want to do in the core, at least adding a hook for XlogInsert is of great value. IMHO, this feature may not be needed by everyone, the hook-way seems reasonable so that the postgres vendors can provide different implementations (for instance they can write an extension that implements this hook which can block writes on primary, write some log messages, inform some service layer of the replicas falling behind the primary etc.). If we were to have the hook in XLogInsert which gets called so frequently or XLogInsert is a hot-path, the hook really should do as little work as possible, otherwise the write operations latency may increase. > A few other scenarios I can think of with the hook are: > > Enforcing RPO as described above > Enforcing rate limit and slow throttling when sync standby is falling behind (could be flush lag or replay lag) > Transactional log rate governance - useful for cloud providers to provide SKU sizes based on allowed WAL writes. > > Thoughts? The hook can help to achieve the above objectives but where to place it and what parameters it should take as input (or what info it should emit out of the server via the hook) are important too. Having said all, the RPO feature can also be implemented outside of the postgres, a simple implementation could be - get the primary current wal lsn using pg_current_wal_lsn and all the replicas restart_lsn using pg_replication_slot, if they differ by certain amount, then issue ALTER SYSTEM SET READ ONLY command [1] on the primary, this requires the connections to the server and proper access rights. This feature can also be implemented as an extension (without the hook) which doesn't require any connections to the server yet can access the required info primary current_wal_lsn, restart_lsn of the replication slots etc, but the RPO enforcement may not be immediate as the server doesn't have any hooks in XLogInsert or some other area. [1] - https://www.postgresql.org/message-id/CAAJ_b967uKBiW6gbHr5aPzweURYjEGv333FHVHxvJmMhanwHXA%40mail.gmail.com Regards, Bharath Rupireddy.
Fwd: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes
От
SATYANARAYANA NARLAPURAM
Дата:
Please find the attached draft patch.
On Thu, Dec 23, 2021 at 2:47 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
On Thu, Dec 23, 2021 at 5:53 AM SATYANARAYANA NARLAPURAM
<satyanarlapuram@gmail.com> wrote:
>
> Hi Hackers,
>
> I am considering implementing RPO (recovery point objective) enforcement feature for Postgres where the WAL writes on the primary are stalled when the WAL distance between the primary and standby exceeds the configured (replica_lag_in_bytes) threshold. This feature is useful particularly in the disaster recovery setups where primary and standby are in different regions and synchronous replication can't be set up for latency and performance reasons yet requires some level of RPO enforcement.
+1 for the idea in general. However, blocking writes on primary seems
an extremely radical idea. The replicas can fall behind transiently at
times and blocking writes on the primary may stop applications failing
for these transient times. This is not a problem if the applications
have retry logic for the writes. How about blocking writes on primary
if the replicas fall behind the primary for a certain period of time?
My proposal is to block the caller from writing until the lag situation is improved. Don't want to throw any errors and fail the tranaction. I think we are aligned?
> The idea here is to calculate the lag between the primary and the standby (Async?) server during XLogInsert and block the caller until the lag is less than the threshold value. We can calculate the max lag by iterating over ReplicationSlotCtl->replication_slots.
The "falling behind" can also be quantified by the number of
write-transactions on the primary. I think it's good to have the users
choose what the "falling behind" means for them. We can have something
like the "recovery_target" param with different options name, xid,
time, lsn.
The transactions can be of arbitrary size and length and these options may not provide the desired results. Time is a worthy option to add.
> If this is not something we don't want to do in the core, at least adding a hook for XlogInsert is of great value.
IMHO, this feature may not be needed by everyone, the hook-way seems
reasonable so that the postgres vendors can provide different
implementations (for instance they can write an extension that
implements this hook which can block writes on primary, write some log
messages, inform some service layer of the replicas falling behind the
primary etc.). If we were to have the hook in XLogInsert which gets
called so frequently or XLogInsert is a hot-path, the hook really
should do as little work as possible, otherwise the write operations
latency may increase.
A Hook is a good start. If there is enough interest then an extension can be added to the contrib module.
> A few other scenarios I can think of with the hook are:
>
> Enforcing RPO as described above
> Enforcing rate limit and slow throttling when sync standby is falling behind (could be flush lag or replay lag)
> Transactional log rate governance - useful for cloud providers to provide SKU sizes based on allowed WAL writes.
>
> Thoughts?
The hook can help to achieve the above objectives but where to place
it and what parameters it should take as input (or what info it should
emit out of the server via the hook) are important too.
XLogInsert in my opinion is the best place to call it and the hook can be something like this "void xlog_insert_hook(NULL)" as all the throttling logic required is the current flush position which can be obtained from GetFlushRecPtr and the ReplicationSlotCtl. Attached a draft patch.
Having said all, the RPO feature can also be implemented outside of
the postgres, a simple implementation could be - get the primary
current wal lsn using pg_current_wal_lsn and all the replicas
restart_lsn using pg_replication_slot, if they differ by certain
amount, then issue ALTER SYSTEM SET READ ONLY command [1] on the
primary, this requires the connections to the server and proper access
rights. This feature can also be implemented as an extension (without
the hook) which doesn't require any connections to the server yet can
access the required info primary current_wal_lsn, restart_lsn of the
replication slots etc, but the RPO enforcement may not be immediate as
the server doesn't have any hooks in XLogInsert or some other area.
READ ONLY is a decent choice but can fail the writes or not take into effect until the end of the transaction?
[1] - https://www.postgresql.org/message-id/CAAJ_b967uKBiW6gbHr5aPzweURYjEGv333FHVHxvJmMhanwHXA%40mail.gmail.com
Regards,
Bharath Rupireddy.
Вложения
On Fri, Dec 24, 2021 at 3:27 AM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote:
XLogInsert in my opinion is the best place to call it and the hook can be something like this "void xlog_insert_hook(NULL)" as all the throttling logic required is the current flush position which can be obtained from GetFlushRecPtr and the ReplicationSlotCtl. Attached a draft patch.
IMHO, it is not a good idea to call an external hook function inside a critical section. Generally, we ensure that we do not call any code path within a critical section which can throw an error and if we start calling the external hook then we lose that control. It should be blocked at the operation level itself e.g. ALTER TABLE READ ONLY, or by some other hook at a little higher level.
Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes
От
Bharath Rupireddy
Дата:
On Fri, Dec 24, 2021 at 4:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Fri, Dec 24, 2021 at 3:27 AM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote: >> >> XLogInsert in my opinion is the best place to call it and the hook can be something like this "void xlog_insert_hook(NULL)"as all the throttling logic required is the current flush position which can be obtained from GetFlushRecPtrand the ReplicationSlotCtl. Attached a draft patch. > > IMHO, it is not a good idea to call an external hook function inside a critical section. Generally, we ensure that wedo not call any code path within a critical section which can throw an error and if we start calling the external hookthen we lose that control. It should be blocked at the operation level itself e.g. ALTER TABLE READ ONLY, or by someother hook at a little higher level. Yeah, good point. It's not advisable to give the control to the external module in the critical section. For instance, memory allocation isn't allowed (see [1]) and the ereport(ERROR,....) would transform to PANIC inside the critical section (see [2], [3]). Moreover the critical section is to be short-spanned i.e. executing the as minimal code as possible. There's no guarantee that an external module would follow these. I suggest we do it at the level of transaction start i.e. when a txnid is getting allocated i.e. in AssignTransactionId(). If we do this, when the limit for the throttling is exceeded, the current txn (even if it is a long running txn) continues to do the WAL insertions, the next txns would get blocked. But this is okay and can be conveyed to the users via documentation if need be. We do block txnid assignments for parallel workers in this function, so this is a good choice IMO. Thoughts? [1] /* * You should not do memory allocations within a critical section, because * an out-of-memory error will be escalated to a PANIC. To enforce that * rule, the allocation functions Assert that. */ #define AssertNotInCriticalSection(context) \ Assert(CritSectionCount == 0 || (context)->allowInCritSection) [2] /* * If we are inside a critical section, all errors become PANIC * errors. See miscadmin.h. */ if (CritSectionCount > 0) elevel = PANIC; [3] * A related, but conceptually distinct, mechanism is the "critical section" * mechanism. A critical section not only holds off cancel/die interrupts, * but causes any ereport(ERROR) or ereport(FATAL) to become ereport(PANIC) * --- that is, a system-wide reset is forced. Needless to say, only really * *critical* code should be marked as a critical section! Currently, this * mechanism is only used for XLOG-related code. Regards, Bharath Rupireddy.
Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes
От
SATYANARAYANA NARLAPURAM
Дата:
On Fri, Dec 24, 2021 at 3:13 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Fri, Dec 24, 2021 at 3:27 AM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote:XLogInsert in my opinion is the best place to call it and the hook can be something like this "void xlog_insert_hook(NULL)" as all the throttling logic required is the current flush position which can be obtained from GetFlushRecPtr and the ReplicationSlotCtl. Attached a draft patch.IMHO, it is not a good idea to call an external hook function inside a critical section. Generally, we ensure that we do not call any code path within a critical section which can throw an error and if we start calling the external hook then we lose that control.
Thank you for the comment. XLogInsertRecord is inside a critical section but not XLogInsert. Am I missing something?
It should be blocked at the operation level itself e.g. ALTER TABLE READ ONLY, or by some other hook at a little higher level.
There is a lot of maintenance overhead with a custom implementation at individual databases and tables level. This doesn't provide the necessary control that I am looking for.
On Sun, Dec 26, 2021 at 3:52 AM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote:
On Fri, Dec 24, 2021 at 3:13 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:On Fri, Dec 24, 2021 at 3:27 AM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote:XLogInsert in my opinion is the best place to call it and the hook can be something like this "void xlog_insert_hook(NULL)" as all the throttling logic required is the current flush position which can be obtained from GetFlushRecPtr and the ReplicationSlotCtl. Attached a draft patch.IMHO, it is not a good idea to call an external hook function inside a critical section. Generally, we ensure that we do not call any code path within a critical section which can throw an error and if we start calling the external hook then we lose that control.Thank you for the comment. XLogInsertRecord is inside a critical section but not XLogInsert. Am I missing something?
Actually all the WAL insertions are done under a critical section (except few exceptions), that means if you see all the references of XLogInsert(), it is always called under the critical section and that is my main worry about hooking at XLogInsert level.
Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes
От
SATYANARAYANA NARLAPURAM
Дата:
On Sat, Dec 25, 2021 at 6:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Sun, Dec 26, 2021 at 3:52 AM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote:On Fri, Dec 24, 2021 at 3:13 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:On Fri, Dec 24, 2021 at 3:27 AM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote:XLogInsert in my opinion is the best place to call it and the hook can be something like this "void xlog_insert_hook(NULL)" as all the throttling logic required is the current flush position which can be obtained from GetFlushRecPtr and the ReplicationSlotCtl. Attached a draft patch.IMHO, it is not a good idea to call an external hook function inside a critical section. Generally, we ensure that we do not call any code path within a critical section which can throw an error and if we start calling the external hook then we lose that control.Thank you for the comment. XLogInsertRecord is inside a critical section but not XLogInsert. Am I missing something?Actually all the WAL insertions are done under a critical section (except few exceptions), that means if you see all the references of XLogInsert(), it is always called under the critical section and that is my main worry about hooking at XLogInsert level.
Got it, understood the concern. But can we document the limitations of the hook and let the hook take care of it? I don't expect an error to be thrown here since we are not planning to allocate memory or make file system calls but instead look at the shared memory state and add delays when required.
On Sun, Dec 26, 2021 at 1:06 PM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote: > > Got it, understood the concern. But can we document the limitations of the hook and let the hook take care of it? I don'texpect an error to be thrown here since we are not planning to allocate memory or make file system calls but insteadlook at the shared memory state and add delays when required. It wouldn't work. You can't make any assumption about how long it would take for the replication lag to resolve, so you may have to wait for a very long time. It means that at the very least the sleep has to be interruptible and therefore can raise an error. In general there isn't much you can due in a critical section, so this approach doesn't seem sensible to me.
On Sun, Dec 26, 2021 at 10:36 AM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote:
Actually all the WAL insertions are done under a critical section (except few exceptions), that means if you see all the references of XLogInsert(), it is always called under the critical section and that is my main worry about hooking at XLogInsert level.Got it, understood the concern. But can we document the limitations of the hook and let the hook take care of it? I don't expect an error to be thrown here since we are not planning to allocate memory or make file system calls but instead look at the shared memory state and add delays when required.
Yet another problem is that if we are in XlogInsert() that means we are holding the buffer locks on all the pages we have modified, so if we add a hook at that level which can make it wait then we would also block any of the read operations needed to read from those buffers. I haven't thought what could be better way to do this but this is certainly not good.
Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes
От
SATYANARAYANA NARLAPURAM
Дата:
On Sat, Dec 25, 2021 at 9:25 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Sun, Dec 26, 2021 at 10:36 AM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote:Actually all the WAL insertions are done under a critical section (except few exceptions), that means if you see all the references of XLogInsert(), it is always called under the critical section and that is my main worry about hooking at XLogInsert level.Got it, understood the concern. But can we document the limitations of the hook and let the hook take care of it? I don't expect an error to be thrown here since we are not planning to allocate memory or make file system calls but instead look at the shared memory state and add delays when required.Yet another problem is that if we are in XlogInsert() that means we are holding the buffer locks on all the pages we have modified, so if we add a hook at that level which can make it wait then we would also block any of the read operations needed to read from those buffers. I haven't thought what could be better way to do this but this is certainly not good.
Yes, this is a problem. The other approach is adding a hook at XLogWrite/XLogFlush? All the other backends will be waiting behind the WALWriteLock. The process that is performing the write enters into a busy loop with small delays until the criteria are met. Inability to process the interrupts inside the critical section is a challenge in both approaches. Any other thoughts?
Greetings, * SATYANARAYANA NARLAPURAM (satyanarlapuram@gmail.com) wrote: > On Sat, Dec 25, 2021 at 9:25 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Sun, Dec 26, 2021 at 10:36 AM SATYANARAYANA NARLAPURAM < > > satyanarlapuram@gmail.com> wrote: > >>> Actually all the WAL insertions are done under a critical section > >>> (except few exceptions), that means if you see all the references of > >>> XLogInsert(), it is always called under the critical section and that is my > >>> main worry about hooking at XLogInsert level. > >>> > >> > >> Got it, understood the concern. But can we document the limitations of > >> the hook and let the hook take care of it? I don't expect an error to be > >> thrown here since we are not planning to allocate memory or make file > >> system calls but instead look at the shared memory state and add delays > >> when required. > >> > >> > > Yet another problem is that if we are in XlogInsert() that means we are > > holding the buffer locks on all the pages we have modified, so if we add a > > hook at that level which can make it wait then we would also block any of > > the read operations needed to read from those buffers. I haven't thought > > what could be better way to do this but this is certainly not good. > > > > Yes, this is a problem. The other approach is adding a hook at > XLogWrite/XLogFlush? All the other backends will be waiting behind the > WALWriteLock. The process that is performing the write enters into a busy > loop with small delays until the criteria are met. Inability to process the > interrupts inside the critical section is a challenge in both approaches. > Any other thoughts? Why not have this work the exact same way sync replicas do, except that it's based off of some byte/time lag for some set of async replicas? That is, in RecordTransactionCommit(), perhaps right after the SyncRepWaitForLSN() call, or maybe even add this to that function? Sure seems like there's a lot of similarity. Thanks, Stephen
Вложения
Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes
От
SATYANARAYANA NARLAPURAM
Дата:
Stephen, thank you!
On Wed, Dec 29, 2021 at 5:46 AM Stephen Frost <sfrost@snowman.net> wrote:
Greetings,
* SATYANARAYANA NARLAPURAM (satyanarlapuram@gmail.com) wrote:
> On Sat, Dec 25, 2021 at 9:25 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > On Sun, Dec 26, 2021 at 10:36 AM SATYANARAYANA NARLAPURAM <
> > satyanarlapuram@gmail.com> wrote:
> >>> Actually all the WAL insertions are done under a critical section
> >>> (except few exceptions), that means if you see all the references of
> >>> XLogInsert(), it is always called under the critical section and that is my
> >>> main worry about hooking at XLogInsert level.
> >>>
> >>
> >> Got it, understood the concern. But can we document the limitations of
> >> the hook and let the hook take care of it? I don't expect an error to be
> >> thrown here since we are not planning to allocate memory or make file
> >> system calls but instead look at the shared memory state and add delays
> >> when required.
> >>
> >>
> > Yet another problem is that if we are in XlogInsert() that means we are
> > holding the buffer locks on all the pages we have modified, so if we add a
> > hook at that level which can make it wait then we would also block any of
> > the read operations needed to read from those buffers. I haven't thought
> > what could be better way to do this but this is certainly not good.
> >
>
> Yes, this is a problem. The other approach is adding a hook at
> XLogWrite/XLogFlush? All the other backends will be waiting behind the
> WALWriteLock. The process that is performing the write enters into a busy
> loop with small delays until the criteria are met. Inability to process the
> interrupts inside the critical section is a challenge in both approaches.
> Any other thoughts?
Why not have this work the exact same way sync replicas do, except that
it's based off of some byte/time lag for some set of async replicas?
That is, in RecordTransactionCommit(), perhaps right after the
SyncRepWaitForLSN() call, or maybe even add this to that function? Sure
seems like there's a lot of similarity.
I was thinking of achieving log governance (throttling WAL MB/sec) and also providing RPO guarantees. In this model, it is hard to throttle WAL generation of a long running transaction (for example copy/select into). However, this meets my RPO needs. Are you in support of adding a hook or the actual change? IMHO, the hook allows more creative options. I can go ahead and make a patch accordingly.
Thanks,
Stephen
Greetings,
On Wed, Dec 29, 2021 at 14:04 SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote:
Stephen, thank you!On Wed, Dec 29, 2021 at 5:46 AM Stephen Frost <sfrost@snowman.net> wrote:Greetings,
* SATYANARAYANA NARLAPURAM (satyanarlapuram@gmail.com) wrote:
> On Sat, Dec 25, 2021 at 9:25 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > On Sun, Dec 26, 2021 at 10:36 AM SATYANARAYANA NARLAPURAM <
> > satyanarlapuram@gmail.com> wrote:
> >>> Actually all the WAL insertions are done under a critical section
> >>> (except few exceptions), that means if you see all the references of
> >>> XLogInsert(), it is always called under the critical section and that is my
> >>> main worry about hooking at XLogInsert level.
> >>>
> >>
> >> Got it, understood the concern. But can we document the limitations of
> >> the hook and let the hook take care of it? I don't expect an error to be
> >> thrown here since we are not planning to allocate memory or make file
> >> system calls but instead look at the shared memory state and add delays
> >> when required.
> >>
> >>
> > Yet another problem is that if we are in XlogInsert() that means we are
> > holding the buffer locks on all the pages we have modified, so if we add a
> > hook at that level which can make it wait then we would also block any of
> > the read operations needed to read from those buffers. I haven't thought
> > what could be better way to do this but this is certainly not good.
> >
>
> Yes, this is a problem. The other approach is adding a hook at
> XLogWrite/XLogFlush? All the other backends will be waiting behind the
> WALWriteLock. The process that is performing the write enters into a busy
> loop with small delays until the criteria are met. Inability to process the
> interrupts inside the critical section is a challenge in both approaches.
> Any other thoughts?
Why not have this work the exact same way sync replicas do, except that
it's based off of some byte/time lag for some set of async replicas?
That is, in RecordTransactionCommit(), perhaps right after the
SyncRepWaitForLSN() call, or maybe even add this to that function? Sure
seems like there's a lot of similarity.I was thinking of achieving log governance (throttling WAL MB/sec) and also providing RPO guarantees. In this model, it is hard to throttle WAL generation of a long running transaction (for example copy/select into).
Long running transactions have a lot of downsides and are best discouraged. I don’t know that we should be designing this for that case specifically, particularly given the complications it would introduce as discussed on this thread already.
However, this meets my RPO needs. Are you in support of adding a hook or the actual change? IMHO, the hook allows more creative options. I can go ahead and make a patch accordingly.
I would think this would make more sense as part of core rather than a hook, as that then requires an extension and additional setup to get going, which raises the bar quite a bit when it comes to actually being used.
Thanks,
Stephen
Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes
От
SATYANARAYANA NARLAPURAM
Дата:
On Wed, Dec 29, 2021 at 11:16 AM Stephen Frost <sfrost@snowman.net> wrote:
Greetings,On Wed, Dec 29, 2021 at 14:04 SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote:Stephen, thank you!On Wed, Dec 29, 2021 at 5:46 AM Stephen Frost <sfrost@snowman.net> wrote:Greetings,
* SATYANARAYANA NARLAPURAM (satyanarlapuram@gmail.com) wrote:
> On Sat, Dec 25, 2021 at 9:25 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > On Sun, Dec 26, 2021 at 10:36 AM SATYANARAYANA NARLAPURAM <
> > satyanarlapuram@gmail.com> wrote:
> >>> Actually all the WAL insertions are done under a critical section
> >>> (except few exceptions), that means if you see all the references of
> >>> XLogInsert(), it is always called under the critical section and that is my
> >>> main worry about hooking at XLogInsert level.
> >>>
> >>
> >> Got it, understood the concern. But can we document the limitations of
> >> the hook and let the hook take care of it? I don't expect an error to be
> >> thrown here since we are not planning to allocate memory or make file
> >> system calls but instead look at the shared memory state and add delays
> >> when required.
> >>
> >>
> > Yet another problem is that if we are in XlogInsert() that means we are
> > holding the buffer locks on all the pages we have modified, so if we add a
> > hook at that level which can make it wait then we would also block any of
> > the read operations needed to read from those buffers. I haven't thought
> > what could be better way to do this but this is certainly not good.
> >
>
> Yes, this is a problem. The other approach is adding a hook at
> XLogWrite/XLogFlush? All the other backends will be waiting behind the
> WALWriteLock. The process that is performing the write enters into a busy
> loop with small delays until the criteria are met. Inability to process the
> interrupts inside the critical section is a challenge in both approaches.
> Any other thoughts?
Why not have this work the exact same way sync replicas do, except that
it's based off of some byte/time lag for some set of async replicas?
That is, in RecordTransactionCommit(), perhaps right after the
SyncRepWaitForLSN() call, or maybe even add this to that function? Sure
seems like there's a lot of similarity.I was thinking of achieving log governance (throttling WAL MB/sec) and also providing RPO guarantees. In this model, it is hard to throttle WAL generation of a long running transaction (for example copy/select into).Long running transactions have a lot of downsides and are best discouraged. I don’t know that we should be designing this for that case specifically, particularly given the complications it would introduce as discussed on this thread already.However, this meets my RPO needs. Are you in support of adding a hook or the actual change? IMHO, the hook allows more creative options. I can go ahead and make a patch accordingly.I would think this would make more sense as part of core rather than a hook, as that then requires an extension and additional setup to get going, which raises the bar quite a bit when it comes to actually being used.
Sounds good, I will work on making the changes accordingly.
Thanks,Stephen
Hi, On 2021-12-27 16:40:28 -0800, SATYANARAYANA NARLAPURAM wrote: > > Yet another problem is that if we are in XlogInsert() that means we are > > holding the buffer locks on all the pages we have modified, so if we add a > > hook at that level which can make it wait then we would also block any of > > the read operations needed to read from those buffers. I haven't thought > > what could be better way to do this but this is certainly not good. > > > > Yes, this is a problem. The other approach is adding a hook at > XLogWrite/XLogFlush? That's pretty much the same - XLogInsert() can trigger an XLogWrite()/Flush(). I think it's a complete no-go to add throttling to these places. It's quite possible that it'd cause new deadlocks, and it's almost guaranteed to have unintended consequences (e.g. replication falling back further because XLogFlush() is being throttled). I also don't think it's a sane thing to add hooks to these places. It's complicated enough as-is, adding the chance for random other things to happen during such crucial operations will make it even harder to maintain. Greetings, Andres Freund
Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes
От
SATYANARAYANA NARLAPURAM
Дата:
On Wed, Dec 29, 2021 at 11:31 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2021-12-27 16:40:28 -0800, SATYANARAYANA NARLAPURAM wrote:
> > Yet another problem is that if we are in XlogInsert() that means we are
> > holding the buffer locks on all the pages we have modified, so if we add a
> > hook at that level which can make it wait then we would also block any of
> > the read operations needed to read from those buffers. I haven't thought
> > what could be better way to do this but this is certainly not good.
> >
>
> Yes, this is a problem. The other approach is adding a hook at
> XLogWrite/XLogFlush?
That's pretty much the same - XLogInsert() can trigger an
XLogWrite()/Flush().
I think it's a complete no-go to add throttling to these places. It's quite
possible that it'd cause new deadlocks, and it's almost guaranteed to have
unintended consequences (e.g. replication falling back further because
XLogFlush() is being throttled).
I also don't think it's a sane thing to add hooks to these places. It's
complicated enough as-is, adding the chance for random other things to happen
during such crucial operations will make it even harder to maintain.
Andres, thanks for the comments. Agreed on this based on the previous discussions on this thread. Could you please share your thoughts on adding it after SyncRepWaitForLSN()?
Greetings,
Andres Freund
Hi, On 2021-12-29 11:34:53 -0800, SATYANARAYANA NARLAPURAM wrote: > On Wed, Dec 29, 2021 at 11:31 AM Andres Freund <andres@anarazel.de> wrote: > Andres, thanks for the comments. Agreed on this based on the previous > discussions on this thread. Could you please share your thoughts on adding > it after SyncRepWaitForLSN()? I don't think that's good either - you're delaying transaction commit (i.e. xact becoming visible / locks being released). That also has the danger of increasing lock contention (albeit more likely to be heavyweight locks / serializable state). It'd have to be after the transaction actually committed. Greetings, Andres Freund
On Thu, Dec 30, 2021 at 1:09 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2021-12-29 11:34:53 -0800, SATYANARAYANA NARLAPURAM wrote:
> On Wed, Dec 29, 2021 at 11:31 AM Andres Freund <andres@anarazel.de> wrote:
> Andres, thanks for the comments. Agreed on this based on the previous
> discussions on this thread. Could you please share your thoughts on adding
> it after SyncRepWaitForLSN()?
I don't think that's good either - you're delaying transaction commit
(i.e. xact becoming visible / locks being released).
Agree with that.
That also has the danger
of increasing lock contention (albeit more likely to be heavyweight locks /
serializable state). It'd have to be after the transaction actually committed.
Yeah, I think that would make sense, even though we will be allowing a new backend to get connected insert WAL, and get committed but after that, it will be throttled. However, if the number of max connections will be very high then even after we detected a lag there a significant amount WAL could be generated, even if we keep long-running transactions aside. But I think still it will serve the purpose of what Satya is trying to achieve.
Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes
От
SATYANARAYANA NARLAPURAM
Дата:
On Wed, Dec 29, 2021 at 10:38 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Thu, Dec 30, 2021 at 1:09 AM Andres Freund <andres@anarazel.de> wrote:Hi,
On 2021-12-29 11:34:53 -0800, SATYANARAYANA NARLAPURAM wrote:
> On Wed, Dec 29, 2021 at 11:31 AM Andres Freund <andres@anarazel.de> wrote:
> Andres, thanks for the comments. Agreed on this based on the previous
> discussions on this thread. Could you please share your thoughts on adding
> it after SyncRepWaitForLSN()?
I don't think that's good either - you're delaying transaction commit
(i.e. xact becoming visible / locks being released).Agree with that.That also has the danger
of increasing lock contention (albeit more likely to be heavyweight locks /
serializable state). It'd have to be after the transaction actually committed.Yeah, I think that would make sense, even though we will be allowing a new backend to get connected insert WAL, and get committed but after that, it will be throttled. However, if the number of max connections will be very high then even after we detected a lag there a significant amount WAL could be generated, even if we keep long-running transactions aside. But I think still it will serve the purpose of what Satya is trying to achieve.
I am afraid there are problems with making the RPO check post releasing the locks. By this time the transaction is committed and visible to the other backends (ProcArrayEndTransaction is already called) though the intention is to block committing transactions that violate the defined RPO. Even though we block existing connections starting a new transaction, it is possible to do writes by opening a new connection / canceling the query. I am not too much worried about the lock contention as the system is already hosed because of the policy. This behavior is very similar to what happens when the Sync standby is not responding. Thoughts?
On Thu, Dec 30, 2021 at 12:36 PM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote:
Yeah, I think that would make sense, even though we will be allowing a new backend to get connected insert WAL, and get committed but after that, it will be throttled. However, if the number of max connections will be very high then even after we detected a lag there a significant amount WAL could be generated, even if we keep long-running transactions aside. But I think still it will serve the purpose of what Satya is trying to achieve.I am afraid there are problems with making the RPO check post releasing the locks. By this time the transaction is committed and visible to the other backends (ProcArrayEndTransaction is already called) though the intention is to block committing transactions that violate the defined RPO. Even though we block existing connections starting a new transaction, it is possible to do writes by opening a new connection / canceling the query. I am not too much worried about the lock contention as the system is already hosed because of the policy. This behavior is very similar to what happens when the Sync standby is not responding. Thoughts?
Yeah, that's true, but even if we are blocking the transactions from committing then also it is possible that a new connection can come and generate more WAL, yeah but I agree with the other part that if you throttle after committing then the user can cancel the queries and generate more WAL from those sessions as well. But that is an extreme case where application developers want to bypass the throttling and want to generate more WALs.
Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes
От
Bharath Rupireddy
Дата:
On Thu, Dec 30, 2021 at 1:21 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Dec 30, 2021 at 12:36 PM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote: >>> >>> >>> Yeah, I think that would make sense, even though we will be allowing a new backend to get connected insert WAL, and getcommitted but after that, it will be throttled. However, if the number of max connections will be very high then evenafter we detected a lag there a significant amount WAL could be generated, even if we keep long-running transactionsaside. But I think still it will serve the purpose of what Satya is trying to achieve. >> >> >> I am afraid there are problems with making the RPO check post releasing the locks. By this time the transaction is committedand visible to the other backends (ProcArrayEndTransaction is already called) though the intention is to block committingtransactions that violate the defined RPO. Even though we block existing connections starting a new transaction,it is possible to do writes by opening a new connection / canceling the query. I am not too much worried aboutthe lock contention as the system is already hosed because of the policy. This behavior is very similar to what happenswhen the Sync standby is not responding. Thoughts? > > > Yeah, that's true, but even if we are blocking the transactions from committing then also it is possible that a new connectioncan come and generate more WAL, yeah but I agree with the other part that if you throttle after committing thenthe user can cancel the queries and generate more WAL from those sessions as well. But that is an extreme case whereapplication developers want to bypass the throttling and want to generate more WALs. How about having the new hook at the start of the new txn? If we do this, when the limit for the throttling is exceeded, the current txn (even if it is a long running one) continues to do the WAL insertions, the next txns would get blocked. Thoughts? Regards, Bharath Rupireddy.
On Thu, Dec 30, 2021 at 1:41 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
>
> Yeah, that's true, but even if we are blocking the transactions from committing then also it is possible that a new connection can come and generate more WAL, yeah but I agree with the other part that if you throttle after committing then the user can cancel the queries and generate more WAL from those sessions as well. But that is an extreme case where application developers want to bypass the throttling and want to generate more WALs.
How about having the new hook at the start of the new txn? If we do
this, when the limit for the throttling is exceeded, the current txn
(even if it is a long running one) continues to do the WAL insertions,
the next txns would get blocked. Thoughts?
Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes
От
SATYANARAYANA NARLAPURAM
Дата:
On Thu, Dec 30, 2021 at 12:20 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Thu, Dec 30, 2021 at 1:41 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
>
> Yeah, that's true, but even if we are blocking the transactions from committing then also it is possible that a new connection can come and generate more WAL, yeah but I agree with the other part that if you throttle after committing then the user can cancel the queries and generate more WAL from those sessions as well. But that is an extreme case where application developers want to bypass the throttling and want to generate more WALs.
How about having the new hook at the start of the new txn? If we do
this, when the limit for the throttling is exceeded, the current txn
(even if it is a long running one) continues to do the WAL insertions,
the next txns would get blocked. Thoughts?Do you mean while StartTransactionCommand or while assigning a new transaction id? If it is at StartTransactionCommand then we would be blocking the sessions which are only performing read queries right?
Definitely not at StartTransactionCommand but possibly while assigning transaction Id inAssignTransactionId. Blocking readers is never the intent.
If we are doing at the transaction assignment level then we might be holding some of the locks so this might not be any better than throttling inside the commit.
If we define RPO as no transaction can commit when the wal_distance is more than configured MB, we had to throttle the writes before committing the transaction and new WAL generation by new connections or active doesn't matter as the transactions can't be committed and visible to the user. If the RPO is defined as no new write transactions allowed when wal_distance > configured MB, then we can block assigning the new transaction IDs until the RPO policy is met. IMHO, following the sync replication semantics is easier and more explainable as it is already familiar to the customers.
Hi, On 2021-12-29 23:06:31 -0800, SATYANARAYANA NARLAPURAM wrote: > I am afraid there are problems with making the RPO check post releasing the > locks. By this time the transaction is committed and visible to the other > backends (ProcArrayEndTransaction is already called) though the intention > is to block committing transactions that violate the defined RPO. Shrug. Anything transaction based has way bigger holes than this. > Even though we block existing connections starting a new transaction, it is > possible to do writes by opening a new connection / canceling the query. If your threat model is users explicitly trying to circumvent this they can cause problems much more easily. Trigger a bunch of vacuums, big COPYs etc. > I am not too much worried about the lock contention as the system is already > hosed because of the policy. This behavior is very similar to what happens > when the Sync standby is not responding. Thoughts? I don't see why we'd bury ourselves deeper in problems just because we already have a problem. There's reasons why we want to do the delay for syncrep be before xact completion - but I don't see those applying to WAL throttling to a significant degree, particularly not when it's on a transaction level. Greetings, Andres Freund
Hi, On 2021-12-29 11:31:51 -0800, Andres Freund wrote: > That's pretty much the same - XLogInsert() can trigger an > XLogWrite()/Flush(). > > I think it's a complete no-go to add throttling to these places. It's quite > possible that it'd cause new deadlocks, and it's almost guaranteed to have > unintended consequences (e.g. replication falling back further because > XLogFlush() is being throttled). I thought of another way to implement this feature. What if we checked the current distance somewhere within XLogInsert(), but only set InterruptPending=true, XLogDelayPending=true. Then in ProcessInterrupts() we check if XLogDelayPending is true and sleep the appropriate time. That way the sleep doesn't happen with important locks held / within a critical section, but we still delay close to where we went over the maximum lag. And the overhead should be fairly minimal. I'm doubtful that implementing the waits on a transactional level provides a meaningful enough amount of control - there's just too much WAL that can be generated within a transaction. Greetings, Andres Freund
On Wed, Jan 5, 2022 at 11:16 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2021-12-29 11:31:51 -0800, Andres Freund wrote: > > That's pretty much the same - XLogInsert() can trigger an > > XLogWrite()/Flush(). > > > > I think it's a complete no-go to add throttling to these places. It's quite > > possible that it'd cause new deadlocks, and it's almost guaranteed to have > > unintended consequences (e.g. replication falling back further because > > XLogFlush() is being throttled). > > I thought of another way to implement this feature. What if we checked the > current distance somewhere within XLogInsert(), but only set > InterruptPending=true, XLogDelayPending=true. Then in ProcessInterrupts() we > check if XLogDelayPending is true and sleep the appropriate time. > > That way the sleep doesn't happen with important locks held / within a > critical section, but we still delay close to where we went over the maximum > lag. And the overhead should be fairly minimal. +1, this sounds like a really good idea to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes
От
SATYANARAYANA NARLAPURAM
Дата:
On Wed, Jan 5, 2022 at 9:46 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2021-12-29 11:31:51 -0800, Andres Freund wrote:
> That's pretty much the same - XLogInsert() can trigger an
> XLogWrite()/Flush().
>
> I think it's a complete no-go to add throttling to these places. It's quite
> possible that it'd cause new deadlocks, and it's almost guaranteed to have
> unintended consequences (e.g. replication falling back further because
> XLogFlush() is being throttled).
I thought of another way to implement this feature. What if we checked the
current distance somewhere within XLogInsert(), but only set
InterruptPending=true, XLogDelayPending=true. Then in ProcessInterrupts() we
check if XLogDelayPending is true and sleep the appropriate time.
That way the sleep doesn't happen with important locks held / within a
critical section, but we still delay close to where we went over the maximum
lag. And the overhead should be fairly minimal.
+1 to the idea, this way we can fairly throttle large and smaller transactions the same way. I will work on this model and share the patch. Please note that the lock contention still exists in this case.
I'm doubtful that implementing the waits on a transactional level provides a
meaningful enough amount of control - there's just too much WAL that can be
generated within a transaction.
Greetings,
Andres Freund
On Thu, Jan 6, 2022 at 11:27 AM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote: > On Wed, Jan 5, 2022 at 9:46 AM Andres Freund <andres@anarazel.de> wrote: >> >> Hi, >> >> On 2021-12-29 11:31:51 -0800, Andres Freund wrote: >> > That's pretty much the same - XLogInsert() can trigger an >> > XLogWrite()/Flush(). >> > >> > I think it's a complete no-go to add throttling to these places. It's quite >> > possible that it'd cause new deadlocks, and it's almost guaranteed to have >> > unintended consequences (e.g. replication falling back further because >> > XLogFlush() is being throttled). >> >> I thought of another way to implement this feature. What if we checked the >> current distance somewhere within XLogInsert(), but only set >> InterruptPending=true, XLogDelayPending=true. Then in ProcessInterrupts() we >> check if XLogDelayPending is true and sleep the appropriate time. >> >> That way the sleep doesn't happen with important locks held / within a >> critical section, but we still delay close to where we went over the maximum >> lag. And the overhead should be fairly minimal. > > > +1 to the idea, this way we can fairly throttle large and smaller transactions the same way. I will work on this modeland share the patch. Please note that the lock contention still exists in this case. Generally while checking for the interrupt we should not be holding any lwlock that means we don't have the risk of holding any buffer locks, so any other reader can continue to read from those buffers. We will only be holding some heavyweight locks like relation/tuple lock but that will not impact anyone except the writers trying to update the same tuple or the DDL who want to modify the table definition so I don't think we have any issue here because anyway we don't want other writers to continue. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes
От
SATYANARAYANA NARLAPURAM
Дата:
On Wed, Jan 5, 2022 at 10:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Thu, Jan 6, 2022 at 11:27 AM SATYANARAYANA NARLAPURAM
<satyanarlapuram@gmail.com> wrote:
> On Wed, Jan 5, 2022 at 9:46 AM Andres Freund <andres@anarazel.de> wrote:
>>
>> Hi,
>>
>> On 2021-12-29 11:31:51 -0800, Andres Freund wrote:
>> > That's pretty much the same - XLogInsert() can trigger an
>> > XLogWrite()/Flush().
>> >
>> > I think it's a complete no-go to add throttling to these places. It's quite
>> > possible that it'd cause new deadlocks, and it's almost guaranteed to have
>> > unintended consequences (e.g. replication falling back further because
>> > XLogFlush() is being throttled).
>>
>> I thought of another way to implement this feature. What if we checked the
>> current distance somewhere within XLogInsert(), but only set
>> InterruptPending=true, XLogDelayPending=true. Then in ProcessInterrupts() we
>> check if XLogDelayPending is true and sleep the appropriate time.
>>
>> That way the sleep doesn't happen with important locks held / within a
>> critical section, but we still delay close to where we went over the maximum
>> lag. And the overhead should be fairly minimal.
>
>
> +1 to the idea, this way we can fairly throttle large and smaller transactions the same way. I will work on this model and share the patch. Please note that the lock contention still exists in this case.
Generally while checking for the interrupt we should not be holding
any lwlock that means we don't have the risk of holding any buffer
locks, so any other reader can continue to read from those buffers.
We will only be holding some heavyweight locks like relation/tuple
lock but that will not impact anyone except the writers trying to
update the same tuple or the DDL who want to modify the table
definition so I don't think we have any issue here because anyway we
don't want other writers to continue.
Yes, it should be ok. I was just making it explicit on Andres' previous comment on holding the heavyweight locks when wait before the commit.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes
От
"Bossart, Nathan"
Дата:
I noticed this thread and thought I'd share my experiences building something similar for Multi-AZ DB clusters [0]. It's not a strict RPO mechanism, but it does throttle backends in an effort to keep the replay lag below a configured maximum. I can share the code if there is interest. I wrote it as a new extension, and except for one piece that I'll go into later, I was able to avoid changes to core PostgreSQL code. The extension manages a background worker that periodically assesses the state of the designated standbys and updates an atomic in shared memory that indicates how long to delay. A transaction callback checks this value and sleeps as necessary. Delay can be injected for write-enabled transactions on the primary, read-only transactions on the standbys, or both. The extension is heavily configurable so that it can meet the needs of a variety of workloads. One interesting challenge I encountered was accurately determining the amount of replay lag. The problem was twofold. First, if there is no activity on the primary, there will be nothing to replay on the standbys, so the replay lag will appear to grow unbounded. To work around this, the extension's background worker periodically creates an empty COMMIT record. Second, if a standby reconnects after a long time, the replay lag won't be accurate for some time. Instead, the replay lag will slowly increase until it reaches the correct value. Since the delay calculation looks at the trend of the replay lag, this apparent unbounded growth causes it to inject far more delay than is necessary. My guess is that this is related to 9ea3c64, and maybe it is worth rethinking that logic. For now, the extension just periodically reports the value of GetLatestXTime() from the standbys to the primary to get an accurate reading. This is done via a new replication callback mechanism (which requires core PostgreSQL changes). I can share this patch along with the extension, as I bet there are other applications for it. I should also note that the extension only considers "active" standbys and primaries. That is, ones with an active WAL sender or WAL receiver. This avoids the need to guess what should be done during a network partition, but it also means that we must gracefully handle standbys reconnecting with massive amounts of lag. The extension is designed to slowly ramp up the amount of injected delay until the standby's apply lag is trending down at a sufficient rate. I see that an approach was suggested upthread for throttling based on WAL distance instead of per-transaction. While the transaction approach works decently well for certain workloads (e.g., many small transactions like those from pgbench), it might require further tuning for very large transactions or workloads with a variety of transaction sizes. For that reason, I would definitely support building a way to throttle based on WAL generation. It might be a good idea to avoid throttling critical activity such as anti-wraparound vacuuming, too. Nathan [0] https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/multi-az-db-clusters-concepts.html
Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes
От
Konstantin Knizhnik
Дата:
On 11.01.2022 03:06, Bossart, Nathan wrote: > I noticed this thread and thought I'd share my experiences building > something similar for Multi-AZ DB clusters [0]. It's not a strict RPO > mechanism, but it does throttle backends in an effort to keep the > replay lag below a configured maximum. I can share the code if there > is interest. > > I wrote it as a new extension, and except for one piece that I'll go > into later, I was able to avoid changes to core PostgreSQL code. The > extension manages a background worker that periodically assesses the > state of the designated standbys and updates an atomic in shared > memory that indicates how long to delay. A transaction callback > checks this value and sleeps as necessary. Delay can be injected for > write-enabled transactions on the primary, read-only transactions on > the standbys, or both. The extension is heavily configurable so that > it can meet the needs of a variety of workloads. > > One interesting challenge I encountered was accurately determining the > amount of replay lag. The problem was twofold. First, if there is no > activity on the primary, there will be nothing to replay on the > standbys, so the replay lag will appear to grow unbounded. To work > around this, the extension's background worker periodically creates an > empty COMMIT record. Second, if a standby reconnects after a long > time, the replay lag won't be accurate for some time. Instead, the > replay lag will slowly increase until it reaches the correct value. > Since the delay calculation looks at the trend of the replay lag, this > apparent unbounded growth causes it to inject far more delay than is > necessary. My guess is that this is related to 9ea3c64, and maybe it > is worth rethinking that logic. For now, the extension just > periodically reports the value of GetLatestXTime() from the standbys > to the primary to get an accurate reading. This is done via a new > replication callback mechanism (which requires core PostgreSQL > changes). I can share this patch along with the extension, as I bet > there are other applications for it. > > I should also note that the extension only considers "active" standbys > and primaries. That is, ones with an active WAL sender or WAL > receiver. This avoids the need to guess what should be done during a > network partition, but it also means that we must gracefully handle > standbys reconnecting with massive amounts of lag. The extension is > designed to slowly ramp up the amount of injected delay until the > standby's apply lag is trending down at a sufficient rate. > > I see that an approach was suggested upthread for throttling based on > WAL distance instead of per-transaction. While the transaction > approach works decently well for certain workloads (e.g., many small > transactions like those from pgbench), it might require further tuning > for very large transactions or workloads with a variety of transaction > sizes. For that reason, I would definitely support building a way to > throttle based on WAL generation. It might be a good idea to avoid > throttling critical activity such as anti-wraparound vacuuming, too. > > Nathan > > [0] https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/multi-az-db-clusters-concepts.html > We have faced with the similar problem in Zenith (open source Aurora) and have to implement back pressure mechanism to prevent overflow of WAL at stateless compute nodes and too long delays of [age reconstruction. Our implementation is the following: 1. Three GUCs are added: max_replication_write/flush/apply_lag 2. Replication lags are checked in XLogInsert and if one of 3 thresholds is reached then InterruptPending is set. 3. In ProcessInterrupts we block backend execution until lag is within specified boundary: #define BACK_PRESSURE_DELAY 10000L // 0.01 sec while(true) { ProcessInterrupts_pg(); // Suspend writers until replicas catch up lag = backpressure_lag(); if (lag <= 0) break; set_ps_display("backpressure throttling"); elog(DEBUG2, "backpressure throttling: lag %lu", lag); pg_usleep(BACK_PRESSURE_DELAY); } What is wrong here is that backend can be blocked for a long time (causing failure of client application due to timeout expiration) and hold acquired locks while sleeping. We are thinking about smarter way of choosing throttling delay (for example exponential increase of throttling sleep interval until some maximal value is reached). But it is really hard to find some universal schema which will be good for all use cases (for example in case of short living session, which clients are connected to the server to execute just one query). Concerning throttling at the end of transaction which eliminates problem with holding locks and do not require changes in postgres core, unfortunately it doesn't address problem with large transactions (for example bulk load of data using COPY). In this case just one transaction can cause arbitrary large lag. I am not sure how critical is the problems with holding locks during throttling: yes, it may block other database activity, including vacuum and execution of read-only queries. But it should not block walsender and so cause deadlock. And in most cases read-only transactions are not conflicting with write transaction, so suspending write transaction should not block readers. Another problem with throttling is large WAL records (for example custom logical replication WAL record can be arbitrary large). If such record is larger than replication lag limit, then it can cause deadlock.
Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes
От
Bharath Rupireddy
Дата:
On Tue, Jan 11, 2022 at 2:11 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote: > > We have faced with the similar problem in Zenith (open source Aurora) > and have to implement back pressure mechanism to prevent overflow of WAL > at stateless compute nodes > and too long delays of [age reconstruction. Our implementation is the > following: > 1. Three GUCs are added: max_replication_write/flush/apply_lag > 2. Replication lags are checked in XLogInsert and if one of 3 thresholds > is reached then InterruptPending is set. > 3. In ProcessInterrupts we block backend execution until lag is within > specified boundary: > > #define BACK_PRESSURE_DELAY 10000L // 0.01 sec > while(true) > { > ProcessInterrupts_pg(); > > // Suspend writers until replicas catch up > lag = backpressure_lag(); > if (lag <= 0) > break; > > set_ps_display("backpressure throttling"); > > elog(DEBUG2, "backpressure throttling: lag %lu", lag); > pg_usleep(BACK_PRESSURE_DELAY); > } > > What is wrong here is that backend can be blocked for a long time > (causing failure of client application due to timeout expiration) and > hold acquired locks while sleeping. Do we ever call CHECK_FOR_INTERRUPTS() while holding "important" locks? I haven't seen any asserts or anything of that sort in ProcessInterrupts() though, looks like it's the caller's responsibility to not process interrupts while holding heavy weight locks, here are some points on this upthread [1]. I don't think we have problem with various postgres timeouts statement_timeout, lock_timeout, idle_in_transaction_session_timeout, idle_session_timeout, client_connection_check_interval, because while we wait for replication lag to get better in ProcessInterrupts(). I think SIGALRM can be raised while we wait for replication lag to get better, but it can't be handled. Why can't we just disable these timeouts before going to wait and reset/enable right after the replication lag gets better? And the clients can always have their own no-reply-kill-transaction-sort-of-timeout, if yes, let them fail and deal with it. I don't think we can do much about this. > We are thinking about smarter way of choosing throttling delay (for > example exponential increase of throttling sleep interval until some > maximal value is reached). > But it is really hard to find some universal schema which will be good > for all use cases (for example in case of short living session, which > clients are connected to the server to execute just one query). I think there has to be an upper limit to wait, perhaps a 'preconfigured amount of time'. I think others upthread aren't happy with failing transactions because of the replication lag. But, my point is how much time we would let the backends wait or throttle WAL writes? It mustn't be forever (say if a broken connection to the async standby is found). [1] https://www.postgresql.org/message-id/20220105174643.lozdd3radxv4tlmx%40alap3.anarazel.de Regards, Bharath Rupireddy.