Обсуждение: walsender performance regression due to logical decoding on standby changes

Поиск
Список
Период
Сортировка

walsender performance regression due to logical decoding on standby changes

От
Andres Freund
Дата:
Hi,

Unfortunately I have found the following commit to have caused a performance
regression:

commit e101dfac3a53c20bfbf1ca85d30a368c2954facf
Author: Andres Freund <andres@anarazel.de>
Date:   2023-04-08 00:24:24 -0700

    For cascading replication, wake physical and logical walsenders separately

    Physical walsenders can't send data until it's been flushed; logical
    walsenders can't decode and send data until it's been applied. On the
    standby, the WAL is flushed first, which will only wake up physical
    walsenders; and then applied, which will only wake up logical
    walsenders.

    Previously, all walsenders were awakened when the WAL was flushed. That
    was fine for logical walsenders on the primary; but on the standby the
    flushed WAL would have been not applied yet, so logical walsenders were
    awakened too early.

    Per idea from Jeff Davis and Amit Kapila.

    Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
    Reviewed-By: Jeff Davis <pgsql@j-davis.com>
    Reviewed-By: Robert Haas <robertmhaas@gmail.com>
    Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
    Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
    Discussion: https://postgr.es/m/CAA4eK1+zO5LUeisabX10c81LU-fWMKO4M9Wyg1cdkbW7Hqh6vQ@mail.gmail.com

The problem is that, on a standby, after the change - as needed to for the
approach to work - the call to WalSndWakeup() in ApplyWalRecord() happens for
every record, instead of only happening when the timeline is changed (or WAL
is flushed or ...).

WalSndWakeup() iterates over all walsender slots, regardless of whether in
use. For each of the walsender slots it acquires a spinlock.

When replaying a lot of small-ish WAL records I found the startup process to
spend the majority of the time in WalSndWakeup(). I've not measured it very
precisely yet, but the overhead is significant (~35% slowdown), even with the
default max_wal_senders. If that's increased substantially, it obviously gets
worse.

The only saving grace is that this is not an issue on the primary.


I unfortunately spent less time on this commit of the
logical-decoding-on-standby series than on the others. There were so many
other senior contributors discussing it, that I "relaxed" a bit too much.


I don't think the approach of not having any sort of "registry" of whether
anybody is waiting for the replay position to be updated is
feasible. Iterating over all walsenders slots is just too expensive -
WalSndWakeup() shows up even if I remove the spinlock (which we likely could,
especially when just checking if the the walsender is connected).

My current guess is that mis-using a condition variable is the best bet. I
think it should work to use ConditionVariablePrepareToSleep() before a
WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use
ConditionVariableSleep(). The latch set from ConditionVariableBroadcast()
would still cause the necessary wakeup.

Greetings,

Andres Freund



Re: walsender performance regression due to logical decoding on standby changes

От
Jeff Davis
Дата:
On Tue, 2023-05-09 at 12:02 -0700, Andres Freund wrote:
> I don't think the approach of not having any sort of "registry" of
> whether
> anybody is waiting for the replay position to be updated is
> feasible. Iterating over all walsenders slots is just too expensive -

Would it work to use a shared counter for the waiters (or, two
counters, one for physical and one for logical), and just early exit if
the count is zero?

Regards,
    Jeff Davis




Re: walsender performance regression due to logical decoding on standby changes

От
Andres Freund
Дата:
Hi,

On 2023-05-09 13:38:24 -0700, Jeff Davis wrote:
> On Tue, 2023-05-09 at 12:02 -0700, Andres Freund wrote:
> > I don't think the approach of not having any sort of "registry" of
> > whether
> > anybody is waiting for the replay position to be updated is
> > feasible. Iterating over all walsenders slots is just too expensive -
> 
> Would it work to use a shared counter for the waiters (or, two
> counters, one for physical and one for logical), and just early exit if
> the count is zero?

That doesn't really fix the problem - once you have a single walsender
connected, performance is bad again.

Greetings,

Andres Freund



Re: walsender performance regression due to logical decoding on standby changes

От
Bharath Rupireddy
Дата:
On Wed, May 10, 2023 at 12:33 AM Andres Freund <andres@anarazel.de> wrote:
>
> Unfortunately I have found the following commit to have caused a performance
> regression:
>
> commit e101dfac3a53c20bfbf1ca85d30a368c2954facf
>
> The problem is that, on a standby, after the change - as needed to for the
> approach to work - the call to WalSndWakeup() in ApplyWalRecord() happens for
> every record, instead of only happening when the timeline is changed (or WAL
> is flushed or ...).
>
> WalSndWakeup() iterates over all walsender slots, regardless of whether in
> use. For each of the walsender slots it acquires a spinlock.
>
> When replaying a lot of small-ish WAL records I found the startup process to
> spend the majority of the time in WalSndWakeup(). I've not measured it very
> precisely yet, but the overhead is significant (~35% slowdown), even with the
> default max_wal_senders. If that's increased substantially, it obviously gets
> worse.

I played it with a simple primary -> standby1 -> standby2 setup. I ran
a pgbench script [1] on primary and counted the number of times
WalSndWakeup() gets called from ApplyWalRecord() and the number of
times spinlock is acquired/released in WalSndWakeup(). It's a whopping
21 million times spinlock is acquired/released on the standby 1 and
standby 2 for just a < 5min of pgbench run on the primary:

standby 1:
2023-05-10 05:32:43.249 UTC [1595600] LOG:  FOO WalSndWakeup() in
ApplyWalRecord() was called 2176352 times
2023-05-10 05:32:43.249 UTC [1595600] LOG:  FOO spinlock
acquisition/release count in WalSndWakeup() is 21763530

standby 2:
2023-05-10 05:32:43.249 UTC [1595625] LOG:  FOO WalSndWakeup() in
ApplyWalRecord() was called 2176352 times
2023-05-10 05:32:43.249 UTC [1595625] LOG:  FOO spinlock
acquisition/release count in WalSndWakeup() is 21763530

In this case, there is no timeline switch or no logical decoding on
the standby or such, but WalSndWakeup() gets called because the
standby can't make out if the slot is for logical or physical
replication unless spinlock is acquired. Before e101dfac3a,
WalSndWakeup() was getting called only when there was a timeline
switch.

> The only saving grace is that this is not an issue on the primary.

Yeah.

> I don't think the approach of not having any sort of "registry" of whether
> anybody is waiting for the replay position to be updated is
> feasible. Iterating over all walsenders slots is just too expensive -
> WalSndWakeup() shows up even if I remove the spinlock (which we likely could,
> especially when just checking if the the walsender is connected).

Right.

> My current guess is that mis-using a condition variable is the best bet. I
> think it should work to use ConditionVariablePrepareToSleep() before a
> WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use
> ConditionVariableSleep(). The latch set from ConditionVariableBroadcast()
> would still cause the necessary wakeup.

How about something like the attached? Recovery and subscription tests
don't complain with the patch.

[1]
./pg_ctl -D data -l logfile stop
./pg_ctl -D sbdata1 -l logfile1 stop
./pg_ctl -D sbdata2 -l logfile2 stop

cd /home/ubuntu/postgres
rm -rf inst
make distclean
./configure --prefix=$PWD/inst/ CFLAGS="-O3" > install.log && make -j
8 install > install.log 2>&1 &

primary -> standby1 -> standby2

free -m
sudo su -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
free -m

cd inst/bin
./initdb -D data
mkdir archived_wal

cat << EOF >> data/postgresql.conf
shared_buffers = '8GB'
wal_buffers = '1GB'
max_wal_size = '16GB'
max_connections = '5000'
archive_mode = 'on'
archive_command='cp %p /home/ubuntu/postgres/inst/bin/archived_wal/%f'
EOF

./pg_ctl -D data -l logfile start
./pg_basebackup -D sbdata1
./psql -c "select pg_create_physical_replication_slot('sb1_slot',
true, false)" postgres

cat << EOF >> sbdata1/postgresql.conf
port=5433
primary_conninfo='host=localhost port=5432 dbname=postgres user=ubuntu
application_name=sb1'
primary_slot_name='sb1_slot'
restore_command='cp /home/ubuntu/postgres/inst/bin/archived_wal/%f %p'
EOF

touch sbdata1/standby.signal

./pg_ctl -D sbdata1 -l logfile1 start
./pg_basebackup -D sbdata2 -p 5433
./psql -p 5433 -c "select
pg_create_physical_replication_slot('sb2_slot', true, false)" postgres

cat << EOF >> sbdata2/postgresql.conf
port=5434
primary_conninfo='host=localhost port=5433 dbname=postgres user=ubuntu
application_name=sb2'
primary_slot_name='sb2_slot'
restore_command='cp /home/ubuntu/postgres/inst/bin/archived_wal/%f %p'
EOF

touch sbdata2/standby.signal

./pg_ctl -D sbdata2 -l logfile2 start
./psql -c "select pid, backend_type from pg_stat_activity where
backend_type in ('walreceiver', 'walsender');" postgres
./psql -p 5433 -c "select pid, backend_type from pg_stat_activity
where backend_type in ('walreceiver', 'walsender');" postgres
./psql -p 5434 -c "select pid, backend_type from pg_stat_activity
where backend_type in ('walreceiver', 'walsender');" postgres

ulimit -S -n 5000
./pgbench --initialize --scale=300 postgres
for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n
"$c ";./pgbench -n -M prepared -U ubuntu postgres -b tpcb-like -c$c
-j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

Re: walsender performance regression due to logical decoding on standby changes

От
"Drouvot, Bertrand"
Дата:
Hi,

On 5/9/23 11:00 PM, Andres Freund wrote:
> Hi,
> 
> On 2023-05-09 13:38:24 -0700, Jeff Davis wrote:
>> On Tue, 2023-05-09 at 12:02 -0700, Andres Freund wrote:
>>> I don't think the approach of not having any sort of "registry" of
>>> whether
>>> anybody is waiting for the replay position to be updated is
>>> feasible. Iterating over all walsenders slots is just too expensive -
>>
>> Would it work to use a shared counter for the waiters (or, two
>> counters, one for physical and one for logical), and just early exit if
>> the count is zero?
> 
> That doesn't really fix the problem - once you have a single walsender
> connected, performance is bad again.
> 

Just to clarify, do you mean that if there is only one remaining active walsender that, say,
has been located at slot n, then we’d still have to loop from 0 to n in WalSndWakeup()?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: walsender performance regression due to logical decoding on standby changes

От
"Drouvot, Bertrand"
Дата:
Hi,

On 5/10/23 8:36 AM, Bharath Rupireddy wrote:
> On Wed, May 10, 2023 at 12:33 AM Andres Freund <andres@anarazel.de> wrote:
>>
>> Unfortunately I have found the following commit to have caused a performance
>> regression:
>>
>> commit e101dfac3a53c20bfbf1ca85d30a368c2954facf
>>
>> The problem is that, on a standby, after the change - as needed to for the
>> approach to work - the call to WalSndWakeup() in ApplyWalRecord() happens for
>> every record, instead of only happening when the timeline is changed (or WAL
>> is flushed or ...).
>>
>> WalSndWakeup() iterates over all walsender slots, regardless of whether in
>> use. For each of the walsender slots it acquires a spinlock.
>>
>> When replaying a lot of small-ish WAL records I found the startup process to
>> spend the majority of the time in WalSndWakeup(). I've not measured it very
>> precisely yet, but the overhead is significant (~35% slowdown), even with the
>> default max_wal_senders. If that's increased substantially, it obviously gets
>> worse.
> 

Thanks Andres for the call out! I do agree that this is a concern.

>> The only saving grace is that this is not an issue on the primary.
> 
> Yeah.

+1

> 
>> My current guess is that mis-using a condition variable is the best bet. I
>> think it should work to use ConditionVariablePrepareToSleep() before a
>> WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use
>> ConditionVariableSleep(). The latch set from ConditionVariableBroadcast()
>> would still cause the necessary wakeup.

Yeah, I think that "mis-using" a condition variable is a valid option. Unless I'm missing
something, I don't think there is anything wrong with using a CV that way (aka not using
ConditionVariableTimedSleep() or ConditionVariableSleep() in this particular case).

> 
> How about something like the attached? Recovery and subscription tests
> don't complain with the patch.

Thanks Bharath for looking at it!

I launched a full Cirrus CI test with it but it failed on one environment (did not look in details,
just sharing this here): https://cirrus-ci.com/task/6570140767092736

Also I have a few comments:

@@ -1958,7 +1959,7 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
          * ------
          */
         if (AllowCascadeReplication())
-               WalSndWakeup(switchedTLI, true);
+               ConditionVariableBroadcast(&WalSndCtl->cv);

I think the comment above this change needs to be updated.


@@ -3368,9 +3370,13 @@ WalSndWait(uint32 socket_events, long timeout, uint32 wait_event)
         WaitEvent       event;

         ModifyWaitEvent(FeBeWaitSet, FeBeWaitSetSocketPos, socket_events, NULL);
+
+       ConditionVariablePrepareToSleep(&WalSndCtl->cv);
         if (WaitEventSetWait(FeBeWaitSet, timeout, &event, 1, wait_event) == 1 &&
                 (event.events & WL_POSTMASTER_DEATH))
                 proc_exit(1);
+
+       ConditionVariableCancelSleep();

May be worth to update the comment above WalSndWait() too? (to explain why a CV handling is part of it).


@@ -108,6 +109,8 @@ typedef struct
          */
         bool            sync_standbys_defined;

+       ConditionVariable cv;

Worth to give it a more meaning full name? (to give a clue as when it is used)


I think we also need to update the comment above WalSndWakeup():

"
  * For cascading replication we need to wake up physical walsenders separately
  * from logical walsenders (see the comment before calling WalSndWakeup() in
  * ApplyWalRecord() for more details).
"

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



RE: walsender performance regression due to logical decoding on standby changes

От
"Zhijie Hou (Fujitsu)"
Дата:
On Wednesday, May 10, 2023 2:36 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
> 
> On Wed, May 10, 2023 at 12:33 AM Andres Freund <andres@anarazel.de>
> wrote:
> >
> > Unfortunately I have found the following commit to have caused a
> performance
> > regression:
> >
> > commit e101dfac3a53c20bfbf1ca85d30a368c2954facf
> >
> > The problem is that, on a standby, after the change - as needed to for the
> > approach to work - the call to WalSndWakeup() in ApplyWalRecord()
> happens for
> > every record, instead of only happening when the timeline is changed (or
> WAL
> > is flushed or ...).
> >
> > WalSndWakeup() iterates over all walsender slots, regardless of whether in
> > use. For each of the walsender slots it acquires a spinlock.
> >
> > When replaying a lot of small-ish WAL records I found the startup process to
> > spend the majority of the time in WalSndWakeup(). I've not measured it very
> > precisely yet, but the overhead is significant (~35% slowdown), even with the
> > default max_wal_senders. If that's increased substantially, it obviously gets
> > worse.
> 
> I played it with a simple primary -> standby1 -> standby2 setup. I ran
> a pgbench script [1] on primary and counted the number of times
> WalSndWakeup() gets called from ApplyWalRecord() and the number of
> times spinlock is acquired/released in WalSndWakeup(). It's a whopping
> 21 million times spinlock is acquired/released on the standby 1 and
> standby 2 for just a < 5min of pgbench run on the primary:
> 
> standby 1:
> 2023-05-10 05:32:43.249 UTC [1595600] LOG:  FOO WalSndWakeup() in
> ApplyWalRecord() was called 2176352 times
> 2023-05-10 05:32:43.249 UTC [1595600] LOG:  FOO spinlock
> acquisition/release count in WalSndWakeup() is 21763530
> 
> standby 2:
> 2023-05-10 05:32:43.249 UTC [1595625] LOG:  FOO WalSndWakeup() in
> ApplyWalRecord() was called 2176352 times
> 2023-05-10 05:32:43.249 UTC [1595625] LOG:  FOO spinlock
> acquisition/release count in WalSndWakeup() is 21763530
> 
> In this case, there is no timeline switch or no logical decoding on
> the standby or such, but WalSndWakeup() gets called because the
> standby can't make out if the slot is for logical or physical
> replication unless spinlock is acquired. Before e101dfac3a,
> WalSndWakeup() was getting called only when there was a timeline
> switch.
> 
> > The only saving grace is that this is not an issue on the primary.
> 
> Yeah.
> 
> > I don't think the approach of not having any sort of "registry" of whether
> > anybody is waiting for the replay position to be updated is
> > feasible. Iterating over all walsenders slots is just too expensive -
> > WalSndWakeup() shows up even if I remove the spinlock (which we likely
> could,
> > especially when just checking if the the walsender is connected).
> 
> Right.
> 
> > My current guess is that mis-using a condition variable is the best bet. I
> > think it should work to use ConditionVariablePrepareToSleep() before a
> > WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use
> > ConditionVariableSleep(). The latch set from ConditionVariableBroadcast()
> > would still cause the necessary wakeup.
> 
> How about something like the attached? Recovery and subscription tests
> don't complain with the patch.

Thanks for the patch. I noticed one place where the logic is different from before and
just to confirm:

    if (AllowCascadeReplication())
-        WalSndWakeup(switchedTLI, true);
+        ConditionVariableBroadcast(&WalSndCtl->cv);

After the change, we wakeup physical walsender regardless of switchedTLI flag.
Is this intentional ? if so, I think It would be better to update the comments above this.

Best Regards,
Hou zj

Re: walsender performance regression due to logical decoding on standby changes

От
Amit Kapila
Дата:
On Wed, May 10, 2023 at 3:41 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Wednesday, May 10, 2023 2:36 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> >
> > > My current guess is that mis-using a condition variable is the best bet. I
> > > think it should work to use ConditionVariablePrepareToSleep() before a
> > > WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use
> > > ConditionVariableSleep(). The latch set from ConditionVariableBroadcast()
> > > would still cause the necessary wakeup.
> >
> > How about something like the attached? Recovery and subscription tests
> > don't complain with the patch.
>
> Thanks for the patch. I noticed one place where the logic is different from before and
> just to confirm:
>
>         if (AllowCascadeReplication())
> -               WalSndWakeup(switchedTLI, true);
> +               ConditionVariableBroadcast(&WalSndCtl->cv);
>
> After the change, we wakeup physical walsender regardless of switchedTLI flag.
> Is this intentional ? if so, I think It would be better to update the comments above this.
>

This raises the question of whether we need this condition variable
logic for physical walsenders?

--
With Regards,
Amit Kapila.



Re: walsender performance regression due to logical decoding on standby changes

От
Masahiko Sawada
Дата:
On Wed, May 10, 2023 at 7:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, May 10, 2023 at 3:41 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Wednesday, May 10, 2023 2:36 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
> > >
> > >
> > > > My current guess is that mis-using a condition variable is the best bet. I
> > > > think it should work to use ConditionVariablePrepareToSleep() before a
> > > > WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use
> > > > ConditionVariableSleep(). The latch set from ConditionVariableBroadcast()
> > > > would still cause the necessary wakeup.
> > >
> > > How about something like the attached? Recovery and subscription tests
> > > don't complain with the patch.
> >
> > Thanks for the patch. I noticed one place where the logic is different from before and
> > just to confirm:
> >
> >         if (AllowCascadeReplication())
> > -               WalSndWakeup(switchedTLI, true);
> > +               ConditionVariableBroadcast(&WalSndCtl->cv);
> >
> > After the change, we wakeup physical walsender regardless of switchedTLI flag.
> > Is this intentional ? if so, I think It would be better to update the comments above this.
> >
>
> This raises the question of whether we need this condition variable
> logic for physical walsenders?

It sounds like a good idea. We can have two condition variables for
logical and physical walsenders, and selectively wake up walsenders
sleeping on the condition variables. It should work, it seems like
much of a hack, though.

Regards,

[1] https://www.postgresql.org/message-id/2d314c22b9e03415aa1c7d8fd1f698dae60effa7.camel%40j-davis.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



RE: walsender performance regression due to logical decoding on standby changes

От
"Zhijie Hou (Fujitsu)"
Дата:
On Wednesday, May 10, 2023 3:03 AM Andres Freund <andres@anarazel.de> wrote:
> Hi,
> 
> Unfortunately I have found the following commit to have caused a
> performance
> regression:
> 
> commit e101dfac3a53c20bfbf1ca85d30a368c2954facf
> Author: Andres Freund <andres@anarazel.de>
> Date:   2023-04-08 00:24:24 -0700
> 
>     For cascading replication, wake physical and logical walsenders separately
> 
>     Physical walsenders can't send data until it's been flushed; logical
>     walsenders can't decode and send data until it's been applied. On the
>     standby, the WAL is flushed first, which will only wake up physical
>     walsenders; and then applied, which will only wake up logical
>     walsenders.
> 
>     Previously, all walsenders were awakened when the WAL was flushed. That
>     was fine for logical walsenders on the primary; but on the standby the
>     flushed WAL would have been not applied yet, so logical walsenders were
>     awakened too early.
> 
>     Per idea from Jeff Davis and Amit Kapila.
> 
>     Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
>     Reviewed-By: Jeff Davis <pgsql@j-davis.com>
>     Reviewed-By: Robert Haas <robertmhaas@gmail.com>
>     Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
>     Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
>     Discussion:
> https://postgr.es/m/CAA4eK1+zO5LUeisabX10c81LU-fWMKO4M9Wyg1cdkb
> W7Hqh6vQ@mail.gmail.com
> 
> The problem is that, on a standby, after the change - as needed to for the
> approach to work - the call to WalSndWakeup() in ApplyWalRecord() happens
> for
> every record, instead of only happening when the timeline is changed (or WAL
> is flushed or ...).
> 
> WalSndWakeup() iterates over all walsender slots, regardless of whether in
> use. For each of the walsender slots it acquires a spinlock.
> 
> When replaying a lot of small-ish WAL records I found the startup process to
> spend the majority of the time in WalSndWakeup(). I've not measured it very
> precisely yet, but the overhead is significant (~35% slowdown), even with the
> default max_wal_senders. If that's increased substantially, it obviously gets
> worse.

I did some simple tests for this to see the performance impact on
the streaming replication, just share it here for reference.

1) sync primary-standby setup, load data on primary and count the time spent on
  replication. the degradation will be more obvious as the value of max_wal_senders
  increases.

max_wal_senders    before(ms)    after(ms)    degradation
100    13394.4013    14141.2615    5.58%
200    13280.8507    14597.1173    9.91%
300    13535.0232    16735.7379    23.65%

2) Similar as 1) but count the time that the standby startup process spent on
   replaying WAL(via gprof).

10 senders
===========
before
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
  4.12      0.45     0.11        1     0.11     2.46  PerformWalRecovery

after
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 17.99      0.59     0.59  4027383     0.00     0.00  WalSndWakeup
  8.23      0.86     0.27        1     0.27     3.11  PerformWalRecovery

100 senders
===========
before
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
  5.56      0.36     0.18        1     0.18     2.91  PerformWalRecovery

after
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 64.65      4.39     4.39  4027383     0.00     0.00  WalSndWakeup
  2.95      4.59     0.20        1     0.20     6.62  PerformWalRecovery

Will test after applying the latest patch in this thread later.

Best Regards,
Hou zj

Re: walsender performance regression due to logical decoding on standby changes

От
Bharath Rupireddy
Дата:
On Wed, May 10, 2023 at 3:23 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
> >> My current guess is that mis-using a condition variable is the best bet. I
> >> think it should work to use ConditionVariablePrepareToSleep() before a
> >> WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use
> >> ConditionVariableSleep(). The latch set from ConditionVariableBroadcast()
> >> would still cause the necessary wakeup.
> >
> > How about something like the attached? Recovery and subscription tests
> > don't complain with the patch.
>
> I launched a full Cirrus CI test with it but it failed on one environment (did not look in details,
> just sharing this here): https://cirrus-ci.com/task/6570140767092736

Yeah, v1 had ConditionVariableInit() such that the CV was getting
initialized for every backend as opposed to just once after the WAL
sender shmem was created.

> Also I have a few comments:

Indeed, v1 was a WIP patch. Please have a look at the attached v2
patch, which has comments and passing CI runs on all platforms -
https://github.com/BRupireddy/postgres/tree/optimize_walsender_wakeup_logic_v2.

On Wed, May 10, 2023 at 3:41 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
>         if (AllowCascadeReplication())
> -               WalSndWakeup(switchedTLI, true);
> +               ConditionVariableBroadcast(&WalSndCtl->cv);
>
> After the change, we wakeup physical walsender regardless of switchedTLI flag.
> Is this intentional ? if so, I think It would be better to update the comments above this.

That's not the case with the attached v2 patch. Please have a look.

On Thu, May 11, 2023 at 10:27 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> We can have two condition variables for
> logical and physical walsenders, and selectively wake up walsenders
> sleeping on the condition variables. It should work, it seems like
> much of a hack, though.

Andres, rightly put it - 'mis-using' CV infrastructure. It is simple,
works, and makes the WalSndWakeup() easy solving the performance
regression.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

RE: walsender performance regression due to logical decoding on standby changes

От
"Zhijie Hou (Fujitsu)"
Дата:
On Friday, May 12, 2023 7:58 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
> 
> On Wed, May 10, 2023 at 3:23 PM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
> >
> > >> My current guess is that mis-using a condition variable is the best
> > >> bet. I think it should work to use
> > >> ConditionVariablePrepareToSleep() before a WalSndWait(), and then
> > >> ConditionVariableCancelSleep(). I.e. to never use
> > >> ConditionVariableSleep(). The latch set from
> ConditionVariableBroadcast() would still cause the necessary wakeup.
> > >
> > > How about something like the attached? Recovery and subscription
> > > tests don't complain with the patch.
> >
> > I launched a full Cirrus CI test with it but it failed on one
> > environment (did not look in details, just sharing this here):
> > https://cirrus-ci.com/task/6570140767092736
> 
> Yeah, v1 had ConditionVariableInit() such that the CV was getting initialized for
> every backend as opposed to just once after the WAL sender shmem was
> created.
> 
> > Also I have a few comments:
> 
> Indeed, v1 was a WIP patch. Please have a look at the attached v2 patch, which
> has comments and passing CI runs on all platforms -
> https://github.com/BRupireddy/postgres/tree/optimize_walsender_wakeup_
> logic_v2.
> 
> On Wed, May 10, 2023 at 3:41 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> >         if (AllowCascadeReplication())
> > -               WalSndWakeup(switchedTLI, true);
> > +               ConditionVariableBroadcast(&WalSndCtl->cv);
> >
> > After the change, we wakeup physical walsender regardless of switchedTLI
> flag.
> > Is this intentional ? if so, I think It would be better to update the comments
> above this.
> 
> That's not the case with the attached v2 patch. Please have a look.

Thanks for updating the patch. I did some simple primary->standby replication test for the
patch and can see the degradation doesn't happen in the replication after applying it[1].

One nitpick in the comment:

+     * walsenders. It makes WalSndWakeup() callers life easy.

callers life easy => callers' life easy.


[1]
max_wal_senders = 100
before regression(ms)                after regression(ms)    v2 patch(ms)
13394.4013                          14141.2615              13455.2543
Compared with before regression     5.58%                   0.45%

max_wal_senders = 200
before regression(ms)                after regression(ms)     v2 patch(ms)
13280.8507                          14597.1173              13632.0606
Compared with before regression     9.91%                   1.64%

max_wal_senders = 300
before regression(ms)                after regression(ms)     v2 patch(ms)
13535.0232                          16735.7379              13705.7135
Compared with before regression     23.65%                  1.26%

Best Regards,
Hou zj

Re: walsender performance regression due to logical decoding on standby changes

От
Thomas Munro
Дата:
On Fri, May 12, 2023 at 11:58 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
> Andres, rightly put it - 'mis-using' CV infrastructure. It is simple,
> works, and makes the WalSndWakeup() easy solving the performance
> regression.

Yeah, this seems OK, and better than the complicated alternatives.  If
one day we want to implement CVs some other way so that this
I-know-that-CVs-are-really-made-out-of-latches abstraction leak
becomes a problem, and we still need this, well then we can make a
separate latch-wait-list thing.



Re: walsender performance regression due to logical decoding on standby changes

От
"Drouvot, Bertrand"
Дата:
Hi,

On 5/15/23 4:19 AM, Zhijie Hou (Fujitsu) wrote:
> On Friday, May 12, 2023 7:58 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
>>
>> On Wed, May 10, 2023 at 3:23 PM Drouvot, Bertrand
>>
>> That's not the case with the attached v2 patch. Please have a look.
> 

Thanks for V2! It does look good to me and I like the fact that
WalSndWakeup() does not need to loop on all the Walsenders slot
anymore (for both the physical and logical cases).

> Thanks for updating the patch. I did some simple primary->standby replication test for the
> patch and can see the degradation doesn't happen in the replication after applying it[1].
> 

Thanks for the performance regression measurement!

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: walsender performance regression due to logical decoding on standby changes

От
Masahiko Sawada
Дата:
On Mon, May 15, 2023 at 1:49 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Fri, May 12, 2023 at 11:58 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> > Andres, rightly put it - 'mis-using' CV infrastructure. It is simple,
> > works, and makes the WalSndWakeup() easy solving the performance
> > regression.
>
> Yeah, this seems OK, and better than the complicated alternatives.  If
> one day we want to implement CVs some other way so that this
> I-know-that-CVs-are-really-made-out-of-latches abstraction leak
> becomes a problem, and we still need this, well then we can make a
> separate latch-wait-list thing.

+1

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: walsender performance regression due to logical decoding on standby changes

От
Bharath Rupireddy
Дата:
On Mon, May 15, 2023 at 6:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, May 15, 2023 at 1:49 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> >
> > On Fri, May 12, 2023 at 11:58 PM Bharath Rupireddy
> > <bharath.rupireddyforpostgres@gmail.com> wrote:
> > > Andres, rightly put it - 'mis-using' CV infrastructure. It is simple,
> > > works, and makes the WalSndWakeup() easy solving the performance
> > > regression.
> >
> > Yeah, this seems OK, and better than the complicated alternatives.  If
> > one day we want to implement CVs some other way so that this
> > I-know-that-CVs-are-really-made-out-of-latches abstraction leak
> > becomes a problem, and we still need this, well then we can make a
> > separate latch-wait-list thing.
>
> +1

Thanks for acknowledging the approach.

On Mon, May 15, 2023 at 2:11 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
> Thanks for V2! It does look good to me and I like the fact that
> WalSndWakeup() does not need to loop on all the Walsenders slot
> anymore (for both the physical and logical cases).

Indeed, it doesn't have to.

On Mon, May 15, 2023 at 7:50 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Thanks for updating the patch. I did some simple primary->standby replication test for the
> patch and can see the degradation doesn't happen in the replication after applying it[1].
>
> One nitpick in the comment:
>
> +        * walsenders. It makes WalSndWakeup() callers life easy.
>
> callers life easy => callers' life easy.

Changed.

> [1]
> max_wal_senders = 100
> before regression(ms)                after regression(ms)    v2 patch(ms)
> 13394.4013                          14141.2615              13455.2543
> Compared with before regression     5.58%                   0.45%
>
> max_wal_senders = 200
> before regression(ms)                after regression(ms)     v2 patch(ms)
> 13280.8507                          14597.1173              13632.0606
> Compared with before regression     9.91%                   1.64%
>
> max_wal_senders = 300
> before regression(ms)                after regression(ms)     v2 patch(ms)
> 13535.0232                          16735.7379              13705.7135
> Compared with before regression     23.65%                  1.26%

Yes, the numbers with v2 patch look close to where we were before.
Thanks for confirming. Just wondering, where does this extra
0.45%/1.64%/1.26% coming from?

Please find the attached v3 with the review comment addressed.

Do we need to add an open item for this issue in
https://wiki.postgresql.org/wiki/PostgreSQL_16_Open_Items? If yes, can
anyone in this loop add one?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

Re: walsender performance regression due to logical decoding on standby changes

От
"Drouvot, Bertrand"
Дата:
Hi,

On 5/15/23 4:39 PM, Bharath Rupireddy wrote:
> On Mon, May 15, 2023 at 6:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>
>> On Mon, May 15, 2023 at 1:49 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> 
> Do we need to add an open item for this issue in
> https://wiki.postgresql.org/wiki/PostgreSQL_16_Open_Items? If yes, can
> anyone in this loop add one?

I do think we need one for this issue and then just added it.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: walsender performance regression due to logical decoding on standby changes

От
Andres Freund
Дата:
Hi,

On 2023-05-10 08:39:08 +0200, Drouvot, Bertrand wrote:
> On 5/9/23 11:00 PM, Andres Freund wrote:
> > Hi,
> > 
> > On 2023-05-09 13:38:24 -0700, Jeff Davis wrote:
> > > On Tue, 2023-05-09 at 12:02 -0700, Andres Freund wrote:
> > > > I don't think the approach of not having any sort of "registry" of
> > > > whether
> > > > anybody is waiting for the replay position to be updated is
> > > > feasible. Iterating over all walsenders slots is just too expensive -
> > > 
> > > Would it work to use a shared counter for the waiters (or, two
> > > counters, one for physical and one for logical), and just early exit if
> > > the count is zero?
> > 
> > That doesn't really fix the problem - once you have a single walsender
> > connected, performance is bad again.
> > 
> 
> Just to clarify, do you mean that if there is only one remaining active walsender that, say,
> has been located at slot n, then we’d still have to loop from 0 to n in WalSndWakeup()?

I understood Jeff's proposal to just have an early exit if there are no
walsenders connected at all. But yes, even if we stopped iterating after
finding the number of slots we needed to, having to iterate over empty slots
would be an issue.

But TBH, even if we only did work for connected walsenders, I think this would
still be a performance issue. Acquiring O(#connected-walsenders) spinlocks for
every record is just too expensive.

Greetings,

Andres Freund



Re: walsender performance regression due to logical decoding on standby changes

От
Andres Freund
Дата:
Hi,

On 2023-05-11 09:42:39 +0000, Zhijie Hou (Fujitsu) wrote:
> I did some simple tests for this to see the performance impact on
> the streaming replication, just share it here for reference.
> 
> 1) sync primary-standby setup, load data on primary and count the time spent on
>   replication. the degradation will be more obvious as the value of max_wal_senders
>   increases.

FWIW, using syncrep likely under-estimates the overhead substantially, because
that includes a lot overhead on the WAL generating side.  I saw well over 20%
overhead for the default max_wal_senders=10.

I just created a standby, shut it down, then ran a deterministically-sized
workload on the primary, started the standby, and measured how long it took to
catch up.  I just used the log messages to measure the time.


> 2) Similar as 1) but count the time that the standby startup process spent on
>    replaying WAL(via gprof).

I don't think that's the case here, but IME gprof's overhead is so high, that
it can move bottlenecks quite drastically. The problem is that it adds code to
every function enter/exit - for simple functions, that overhead is much higher
than the "original" cost of the function.

gprof style instrumentation is good for things like code coverage, but for
performance evaluation it's normally better to use a sampling profiler like
perf. That also causes slowdowns, but largely only in places that already take
up substantial execution time.

Greetings,

Andres Freund



Re: walsender performance regression due to logical decoding on standby changes

От
Jeff Davis
Дата:
On Wed, 2023-05-17 at 12:34 -0700, Andres Freund wrote:
> I understood Jeff's proposal to just have an early exit if there are
> no
> walsenders connected at all.

My suggestion was we early exit unless there is at least one *waiting*
walsender of the appropriate type. In other words, increment the
counter on entry to WalSndWait() and decrement on exit. I didn't look
in detail yet, so perhaps it's not easy/cheap to do that safely.

Regards,
    Jeff Davis




Re: walsender performance regression due to logical decoding on standby changes

От
Andres Freund
Дата:
Hi,

Thanks for working on the patch!

On 2023-05-15 20:09:00 +0530, Bharath Rupireddy wrote:
> > [1]
> > max_wal_senders = 100
> > before regression(ms)                after regression(ms)    v2 patch(ms)
> > 13394.4013                          14141.2615              13455.2543
> > Compared with before regression     5.58%                   0.45%
> >
> > max_wal_senders = 200
> > before regression(ms)                after regression(ms)     v2 patch(ms)
> > 13280.8507                          14597.1173              13632.0606
> > Compared with before regression     9.91%                   1.64%
> >
> > max_wal_senders = 300
> > before regression(ms)                after regression(ms)     v2 patch(ms)
> > 13535.0232                          16735.7379              13705.7135
> > Compared with before regression     23.65%                  1.26%
> 
> Yes, the numbers with v2 patch look close to where we were before.
> Thanks for confirming. Just wondering, where does this extra
> 0.45%/1.64%/1.26% coming from?

We still do more work for each WAL record than before, so I'd expect something
small. I'd say right now the main overhead with the patch comes from the
spinlock acquisitions in ConditionVariableBroadcast(), which happen even when
nobody is waiting.

I'll try to come up with a benchmark without the issues I pointed out in
https://postgr.es/m/20230517194331.ficfy5brpfq5lrmz%40awork3.anarazel.de


> +        ConditionVariableInit(&WalSndCtl->physicalWALSndCV);
> +        ConditionVariableInit(&WalSndCtl->logicalWALSndCV);

It's not obvious to me that it's worth having two CVs, because it's more
expensive to find no waiters in two CVs than to find no waiters in one CV.


> +    /*
> +     * We use condition variable (CV) to efficiently wake up walsenders in
> +     * WalSndWakeup().
> +     *
> +     * Every walsender prepares to sleep on a shared memory CV. Note that it
> +     * just prepares to sleep on the CV (i.e., adds itself to the CV's
> +     * waitlist), but not actually waits on the CV (IOW, it never calls
> +     * ConditionVariableSleep()). It still uses WaitEventSetWait() for waiting,
> +     * because CV infrastructure doesn't handle FeBe socket events currently.
> +     * The processes (startup process, walreceiver etc.) wanting to wake up
> +     * walsenders use ConditionVariableBroadcast(), which in turn calls
> +     * SetLatch(), helping walsenders come out of WaitEventSetWait().
> +     *
> +     * This approach is simple and efficient because, one doesn't have to loop
> +     * through all the walsenders slots, with a spinlock acquisition and
> +     * release for every iteration, just to wake up only the waiting
> +     * walsenders. It makes WalSndWakeup() callers' life easy.
> +     *
> +     * XXX: When available, WaitEventSetWait() can be replaced with its CV's
> +     * counterpart.

I don't really understand that XXX - the potential bright future would be to
add support for CVs into wait event sets, not to replace WES with a CV?

Greetings,

Andres Freund



Re: walsender performance regression due to logical decoding on standby changes

От
Andres Freund
Дата:
Hi,

On 2023-05-17 12:53:15 -0700, Andres Freund wrote:
> I'll try to come up with a benchmark without the issues I pointed out in
> https://postgr.es/m/20230517194331.ficfy5brpfq5lrmz%40awork3.anarazel.de

Here we go:

setup:

create primary
SELECT pg_create_physical_replication_slot('reserve', true);
create standby using pg_basebackup

create WAL:
psql -c CREATE TABLE testtable_logged(other_data int default 1);' && \
  c=16; PGOPTIONS='-c synchronous_commit=off' /path-to-pgbench --random-seed=0 -n -c$c -j$c -t1000 -P1 -f <( echo
"INSERTINTO testtable_logged SELECT generate_series(1, 1000)" ) && \
 
  psql -c "SELECT pg_create_restore_point('end');"

benchmark:
rm -rf /tmp/test && \
  cp -ar /srv/dev/pgdev-dev-standby /tmp/test && \
  cp -ar /srv/dev/pgdev-dev/pg_wal/* /tmp/test/pg_wal/ && \
  sync && \
  /usr/bin/time -f '%es' /path-to-postgres -D /tmp/test -c recovery_target_action=shutdown -c recovery_target_name=end
-cshared_buffers=1GB -c fsync=off
 

That way I can measure how long it takes to replay exactly the same WAL, and
also take profiles of exactly the same work, without influencing time results.

I copy the WAL files to the primary to ensure that walreceiver (standby) /
walsender (primary) performance doesn't make the result variability higher.


                      max_walsenders=10     max_walsenders=100
e101dfac3a5 reverted  7.01s                 7.02s
093e5c57d50 / HEAD    8.25s                 19.91s
bharat-v3             7.14s            7.13s

So indeed, bharat-v3 largely fixes the issue.

The regression of v3 compared to e101dfac3a5 reverted seems pretty constant at
~0.982x, independent of the concrete max_walsenders value. Which makes sense,
the work is constant.


To make it more extreme, I also tested a workload that is basically free to replay:

c=16; /srv/dev/build/m-opt/src/bin/pgbench/pgbench --random-seed=0 -n -c$c -j$c -t5000 -P1 -f <( echo "SELECT
pg_logical_emit_message(false,'c', 'a') FROM generate_series(1, 1000)" ) && psql -c "SELECT
pg_create_restore_point('end');"


                      max_walsenders=10     max_walsenders=100
e101dfac3a5 reverted  1.70s                 1.70s
093e5c57d50 / HEAD    3.00s                 14.56s
bharat-v3             1.88s                 1.88s

In this extreme workload we still regress by ~0.904x.

I'm not sure how much it's worth worrying about that - this is a quite
unrealistic testcase.

FWIW, if I just make WalSndWakeup() do nothing, I still see a very small, but
reproducible, overhead: 1.72s - that's just the cost of the additional
external function call.

If I add a no-waiters fastpath using proclist_is_empty() to
ConditionVariableBroadcast(), I get 1.77s. So the majority of the remaining
slowdown indeed comes from the spinlock acquisition in
ConditionVariableBroadcast().

Greetings,

Andres Freund



Re: walsender performance regression due to logical decoding on standby changes

От
Bharath Rupireddy
Дата:
On Thu, May 18, 2023 at 1:23 AM Andres Freund <andres@anarazel.de> wrote:
>
> On 2023-05-15 20:09:00 +0530, Bharath Rupireddy wrote:
> > > [1]
> > > max_wal_senders = 100
> > > before regression(ms)                after regression(ms)    v2 patch(ms)
> > > 13394.4013                          14141.2615              13455.2543
> > > Compared with before regression     5.58%                   0.45%
> > >
> > > max_wal_senders = 200
> > > before regression(ms)                after regression(ms)     v2 patch(ms)
> > > 13280.8507                          14597.1173              13632.0606
> > > Compared with before regression     9.91%                   1.64%
> > >
> > > max_wal_senders = 300
> > > before regression(ms)                after regression(ms)     v2 patch(ms)
> > > 13535.0232                          16735.7379              13705.7135
> > > Compared with before regression     23.65%                  1.26%
> >
> > Yes, the numbers with v2 patch look close to where we were before.
> > Thanks for confirming. Just wondering, where does this extra
> > 0.45%/1.64%/1.26% coming from?
>
> We still do more work for each WAL record than before, so I'd expect something
> small. I'd say right now the main overhead with the patch comes from the
> spinlock acquisitions in ConditionVariableBroadcast(), which happen even when
> nobody is waiting.


> > +             ConditionVariableInit(&WalSndCtl->physicalWALSndCV);
> > +             ConditionVariableInit(&WalSndCtl->logicalWALSndCV);
>
> It's not obvious to me that it's worth having two CVs, because it's more
> expensive to find no waiters in two CVs than to find no waiters in one CV.

I disagree. In the tight per-WAL record recovery loop, WalSndWakeup
wakes up logical walsenders for every WAL record, but it wakes up
physical walsenders only if the applied WAL record causes a TLI
switch. Therefore, the extra cost of spinlock acquire-release for per
WAL record applies only for logical walsenders. On the other hand, if
we were to use a single CV, we would be unnecessarily waking up (if at
all they are sleeping) physical walsenders for every WAL record -
which is costly IMO.

I still think separate CVs are good for selective wake ups given there
can be not so many TLI switch WAL records.

> > +      *
> > +      * XXX: When available, WaitEventSetWait() can be replaced with its CV's
> > +      * counterpart.
>
> I don't really understand that XXX - the potential bright future would be to
> add support for CVs into wait event sets, not to replace WES with a CV?

Yes, I meant it and modified that part to:

     * XXX: A desirable future improvement would be to add support for CVs into
     * WaitEventSetWait().

> FWIW, if I just make WalSndWakeup() do nothing, I still see a very small, but
> reproducible, overhead: 1.72s - that's just the cost of the additional
> external function call.

Hm, this is unavoidable.

> If I add a no-waiters fastpath using proclist_is_empty() to
> ConditionVariableBroadcast(), I get 1.77s. So the majority of the remaining
> slowdown indeed comes from the spinlock acquisition in
> ConditionVariableBroadcast().

Acquiring spinlock for every replayed WAL record seems not great. In
general, this problem exists for CV infrastructure as a whole if
ConditionVariableBroadcast()/CVB() is called in a loop/hot path. I
think this can be proven easily by doing something like - select
pg_replication_origin_session_setup()/pg_wal_replay_resume()/pg_create_physical_replication_slot()
from generate_series(1, 1000000...);, all of these functions end up in
CVB().

Do you think adding a fastpath exit when no waiters to CVB() is worth
implementing? Something like - an atomic state variable to each CV,
similar to LWLock's state variable, setting it when adding waiters and
resetting it when removing waiters, CVB() atomically reading the state
for fastpath (of course, we need memory barriers here). This might
complicate things as each CV structure gets an extra state variable (4
bytes), memory barriers, and atomic writes and reads.

Or given the current uses of CVB() with no callers except recovery
calling it in a hot path with patch, maybe we can add an atomic waiter
count to WalSndCtl, incrementing it atomically in WalSndWait() before
the wait, decrementing it after the wait, and a fastpath in
WalSndWakeup() reading it atomically to avoid CVB() calls. IIUC, this
is something Jeff proposed upthread.

Thoughts?

Please find the attached v4 patch addressing the review comment (not
the fastpath one).

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

Re: walsender performance regression due to logical decoding on standby changes

От
Kyotaro Horiguchi
Дата:
At Thu, 18 May 2023 20:11:11 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in 
> > > +             ConditionVariableInit(&WalSndCtl->physicalWALSndCV);
> > > +             ConditionVariableInit(&WalSndCtl->logicalWALSndCV);
> >
> > It's not obvious to me that it's worth having two CVs, because it's more
> > expensive to find no waiters in two CVs than to find no waiters in one CV.
> 
> I disagree. In the tight per-WAL record recovery loop, WalSndWakeup
> wakes up logical walsenders for every WAL record, but it wakes up
> physical walsenders only if the applied WAL record causes a TLI
> switch. Therefore, the extra cost of spinlock acquire-release for per
> WAL record applies only for logical walsenders. On the other hand, if
> we were to use a single CV, we would be unnecessarily waking up (if at
> all they are sleeping) physical walsenders for every WAL record -
> which is costly IMO.

As I was reading this, I start thinking that one reason for the
regression could be to exccessive frequency of wakeups during logical
replication. In physical replication, we make sure to avoid exccessive
wakeups when the stream is tightly packed.  I'm just wondering why
logical replication doesn't (or can't) do the same thing, IMHO.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



 



RE: walsender performance regression due to logical decoding on standby changes

От
"Zhijie Hou (Fujitsu)"
Дата:
On Friday, May 19, 2023 11:08 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
> At Thu, 18 May 2023 20:11:11 +0530, Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote in
> > > > +             ConditionVariableInit(&WalSndCtl->physicalWALSndCV);
> > > > +             ConditionVariableInit(&WalSndCtl->logicalWALSndCV);
> > >
> > > It's not obvious to me that it's worth having two CVs, because it's more
> > > expensive to find no waiters in two CVs than to find no waiters in one CV.
> >
> > I disagree. In the tight per-WAL record recovery loop, WalSndWakeup
> > wakes up logical walsenders for every WAL record, but it wakes up
> > physical walsenders only if the applied WAL record causes a TLI
> > switch. Therefore, the extra cost of spinlock acquire-release for per
> > WAL record applies only for logical walsenders. On the other hand, if
> > we were to use a single CV, we would be unnecessarily waking up (if at
> > all they are sleeping) physical walsenders for every WAL record -
> > which is costly IMO.
>
> As I was reading this, I start thinking that one reason for the
> regression could be to exccessive frequency of wakeups during logical
> replication. In physical replication, we make sure to avoid exccessive
> wakeups when the stream is tightly packed.  I'm just wondering why
> logical replication doesn't (or can't) do the same thing, IMHO.

I thought(from the e101dfa's commit message) physical walsenders can't send
data until it's been flushed, so it only gets wakeup at the time of flush or TLI
switch. For logical walsender, it can start to decode changes after the change
is applied, and to avoid the case if the walsender is asleep, and there's work
to be done, it wakeup logical walsender when applying each record.

Or maybe you mean to wakeup(ConditionVariableBroadcast) walsender after
applying some wal records like[1], But it seems it may delay the wakeup of
walsender a bit(the walsender may be asleep before reaching the threshold).




[1]
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1833,6 +1833,7 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 {
     ErrorContextCallback errcallback;
     bool        switchedTLI = false;
+    static int    nreplay = 0;

     /* Setup error traceback support for ereport() */
     errcallback.callback = rm_redo_error_callback;
@@ -1957,8 +1958,12 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
      *    be created otherwise)
      * ------
      */
-    if (AllowCascadeReplication())
+    if (AllowCascadeReplication() &&
+        nreplay++ == 100)
+    {
         WalSndWakeup(switchedTLI, true);
+        nreplay = 0;
+    }

Best Regards,
Hou zj



Re: walsender performance regression due to logical decoding on standby changes

От
Andres Freund
Дата:
Hi,

On 2023-05-19 12:07:56 +0900, Kyotaro Horiguchi wrote:
> At Thu, 18 May 2023 20:11:11 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in 
> > > > +             ConditionVariableInit(&WalSndCtl->physicalWALSndCV);
> > > > +             ConditionVariableInit(&WalSndCtl->logicalWALSndCV);
> > >
> > > It's not obvious to me that it's worth having two CVs, because it's more
> > > expensive to find no waiters in two CVs than to find no waiters in one CV.
> > 
> > I disagree. In the tight per-WAL record recovery loop, WalSndWakeup
> > wakes up logical walsenders for every WAL record, but it wakes up
> > physical walsenders only if the applied WAL record causes a TLI
> > switch. Therefore, the extra cost of spinlock acquire-release for per
> > WAL record applies only for logical walsenders. On the other hand, if
> > we were to use a single CV, we would be unnecessarily waking up (if at
> > all they are sleeping) physical walsenders for every WAL record -
> > which is costly IMO.
> 
> As I was reading this, I start thinking that one reason for the
> regression could be to exccessive frequency of wakeups during logical
> replication. In physical replication, we make sure to avoid exccessive
> wakeups when the stream is tightly packed.  I'm just wondering why
> logical replication doesn't (or can't) do the same thing, IMHO.

It's possible we could try to reduce the frequency by issuing wakeups only at
specific points. The most obvious thing to do would be to wake only when
waiting for more WAL or when crossing a page boundary, or such. Unfortunately
that could easily lead to deadlocks, because the startup process might be
blocked waiting for a lock, held by a backend doing logical decoding - which
can't progress until the startup process wakes the backend up.

So I don't think this is promising avenue in the near term.

Greetings,

Andres Freund



Re: walsender performance regression due to logical decoding on standby changes

От
Andres Freund
Дата:
Hi,

On 2023-05-18 20:11:11 +0530, Bharath Rupireddy wrote:
> Please find the attached v4 patch addressing the review comment (not
> the fastpath one).

I pushed a mildly edited version of this. I didn't like the name of the CVs
much, so I renamed them to wal_flush_cv/wal_replay_cv. I also did some minor
comment polishing.

Greetings,

Andres Freund



RE: walsender performance regression due to logical decoding on standby changes

От
"Zhijie Hou (Fujitsu)"
Дата:
Hi,

On Monday, May 22, 2023 12:11 AM Andres Freund <andres@anarazel.de> wrote:
> On 2023-05-19 12:07:56 +0900, Kyotaro Horiguchi wrote:
> > At Thu, 18 May 2023 20:11:11 +0530, Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote in
> > > > > +
> ConditionVariableInit(&WalSndCtl->physicalWALSndCV);
> > > > > +             ConditionVariableInit(&WalSndCtl->logicalWALSndCV);
> > > >
> > > > It's not obvious to me that it's worth having two CVs, because it's more
> > > > expensive to find no waiters in two CVs than to find no waiters in one CV.
> > >
> > > I disagree. In the tight per-WAL record recovery loop, WalSndWakeup
> > > wakes up logical walsenders for every WAL record, but it wakes up
> > > physical walsenders only if the applied WAL record causes a TLI
> > > switch. Therefore, the extra cost of spinlock acquire-release for per
> > > WAL record applies only for logical walsenders. On the other hand, if
> > > we were to use a single CV, we would be unnecessarily waking up (if at
> > > all they are sleeping) physical walsenders for every WAL record -
> > > which is costly IMO.
> >
> > As I was reading this, I start thinking that one reason for the
> > regression could be to exccessive frequency of wakeups during logical
> > replication. In physical replication, we make sure to avoid exccessive
> > wakeups when the stream is tightly packed.  I'm just wondering why
> > logical replication doesn't (or can't) do the same thing, IMHO.
>
> It's possible we could try to reduce the frequency by issuing wakeups only at
> specific points. The most obvious thing to do would be to wake only when
> waiting for more WAL or when crossing a page boundary, or such.
> Unfortunately
> that could easily lead to deadlocks, because the startup process might be
> blocked waiting for a lock, held by a backend doing logical decoding - which
> can't progress until the startup process wakes the backend up.

Just out of curiosity about the mentioned deadlock scenario, no comment on the
committed patch.

About "a backend doing logical decoding", do you mean the case when a user
start a backend and invoke pg_logical_slot_get_changes() to do the logical
decoding ? If so, it seems the logical decoding in a backend won't be waked up
by startup process because the backend won't be registered as a walsender so
the backend won't be found in WalSndWakeup().

Or do you mean the deadlock between the real logical walsender and startup
process ? (I might miss something) I think the logical decoding doesn't lock
the target user relation when decoding because it normally can get the needed
information from WAL. Besides, the walsender sometimes will lock the system
table(e.g. use RelidByRelfilenumber() to get the relid) but it will unlock it
after finishing systable scan.

So, if possible, would you be able to share some details about the deadlock
case you mentioned earlier? it's helpful as it can prevent similar problems in
the future development.

Best Regards,
Hou zj



Re: walsender performance regression due to logical decoding on standby changes

От
Andres Freund
Дата:
Hi,

On 2023-05-22 12:15:07 +0000, Zhijie Hou (Fujitsu) wrote:
> About "a backend doing logical decoding", do you mean the case when a user
> start a backend and invoke pg_logical_slot_get_changes() to do the logical
> decoding ? If so, it seems the logical decoding in a backend won't be waked up
> by startup process because the backend won't be registered as a walsender so
> the backend won't be found in WalSndWakeup().

I meant logical decoding happening inside a walsender instance.


> Or do you mean the deadlock between the real logical walsender and startup
> process ? (I might miss something) I think the logical decoding doesn't lock
> the target user relation when decoding because it normally can get the needed
> information from WAL.

It does lock catalog tables briefly. There's no guarantee that such locks are
released immediately. I forgot the details, but IIRC there's some outfuncs
(enum?) that intentionally delay releasing locks till transaction commit.

Greetings,

Andres Freund



RE: walsender performance regression due to logical decoding on standby changes

От
"Zhijie Hou (Fujitsu)"
Дата:
On Tuesday, May 23, 2023 1:53 AM Andres Freund <andres@anarazel.de> wrote:
> On 2023-05-22 12:15:07 +0000, Zhijie Hou (Fujitsu) wrote:
> > About "a backend doing logical decoding", do you mean the case when a
> user
> > start a backend and invoke pg_logical_slot_get_changes() to do the logical
> > decoding ? If so, it seems the logical decoding in a backend won't be waked
> up
> > by startup process because the backend won't be registered as a walsender
> so
> > the backend won't be found in WalSndWakeup().
>
> I meant logical decoding happening inside a walsender instance.
>
>
> > Or do you mean the deadlock between the real logical walsender and startup
> > process ? (I might miss something) I think the logical decoding doesn't lock
> > the target user relation when decoding because it normally can get the
> needed
> > information from WAL.
>
> It does lock catalog tables briefly. There's no guarantee that such locks are
> released immediately. I forgot the details, but IIRC there's some outfuncs
> (enum?) that intentionally delay releasing locks till transaction commit.

Thanks for the explanation !

I understand that the startup process can take lock on the catalog(when
replaying record) which may conflict with the lock in walsender.

But in walsender, I think we only start transaction after entering
ReorderBufferProcessTXN(), and the transaction started here will be released
soon after processing and outputting the decoded transaction's data(as the
comment in ReorderBufferProcessTXN() says:" all locks acquired in here to be
released, not reassigned to the parent and we do not want any database access
have persistent effects.").

Besides, during the process and output of the decoded transaction, the
walsender won't wait for the wakeup of startup process(e.g.
WalSndWaitForWal()), it only waits if the data is being sent to subscriber. So
it seems the lock conflict here won't cause the deadlock for now, although it
may have a risk if we change this logic later. Sorry if I missed something, and
thanks again for your patience in explaining.

Best Regards,
Hou zj





Re: walsender performance regression due to logical decoding on standby changes

От
Andres Freund
Дата:
Hi,

On 2023-05-24 05:53:51 +0000, Zhijie Hou (Fujitsu) wrote:
> On Tuesday, May 23, 2023 1:53 AM Andres Freund <andres@anarazel.de> wrote:
> > On 2023-05-22 12:15:07 +0000, Zhijie Hou (Fujitsu) wrote:
> > > About "a backend doing logical decoding", do you mean the case when a
> > user
> > > start a backend and invoke pg_logical_slot_get_changes() to do the logical
> > > decoding ? If so, it seems the logical decoding in a backend won't be waked
> > up
> > > by startup process because the backend won't be registered as a walsender
> > so
> > > the backend won't be found in WalSndWakeup().
> > 
> > I meant logical decoding happening inside a walsender instance.
> > 
> > 
> > > Or do you mean the deadlock between the real logical walsender and startup
> > > process ? (I might miss something) I think the logical decoding doesn't lock
> > > the target user relation when decoding because it normally can get the
> > needed
> > > information from WAL.
> > 
> > It does lock catalog tables briefly. There's no guarantee that such locks are
> > released immediately. I forgot the details, but IIRC there's some outfuncs
> > (enum?) that intentionally delay releasing locks till transaction commit.
> 
> Thanks for the explanation !
> 
> I understand that the startup process can take lock on the catalog(when
> replaying record) which may conflict with the lock in walsender.
> 
> But in walsender, I think we only start transaction after entering
> ReorderBufferProcessTXN(), and the transaction started here will be released
> soon after processing and outputting the decoded transaction's data(as the
> comment in ReorderBufferProcessTXN() says:" all locks acquired in here to be
> released, not reassigned to the parent and we do not want any database access
> have persistent effects.").

It's possible that there's no immediate danger - but I wouldn't want to bet on
it staying that way. Think about things like streaming out large transactions
etc.  There also are potential dangers outside of plain things like locks -
there could be a recovery conflicts on the snapshot or such (although that
normally should be prevented via hot_standby_feedback or such).


Even if there's no hard deadlock issue - not waking up a
walsender-in-logical-decoding that's currently waiting because (replay_count %
N) != 0 means that the walsender might be delayed for quite a while - there
might not be any further records from the primary in the near future.


I don't think any approach purely based on record counts has any chance to
work well.  You could combine it with something else, e.g. with always waking
up in XLogPageRead() and WaitForWALToBecomeAvailable().  But then you end up
with something relatively complicated again.

Greetings,

Andres Freund