Обсуждение: walsender performance regression due to logical decoding on standby changes
Hi, Unfortunately I have found the following commit to have caused a performance regression: commit e101dfac3a53c20bfbf1ca85d30a368c2954facf Author: Andres Freund <andres@anarazel.de> Date: 2023-04-08 00:24:24 -0700 For cascading replication, wake physical and logical walsenders separately Physical walsenders can't send data until it's been flushed; logical walsenders can't decode and send data until it's been applied. On the standby, the WAL is flushed first, which will only wake up physical walsenders; and then applied, which will only wake up logical walsenders. Previously, all walsenders were awakened when the WAL was flushed. That was fine for logical walsenders on the primary; but on the standby the flushed WAL would have been not applied yet, so logical walsenders were awakened too early. Per idea from Jeff Davis and Amit Kapila. Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com> Reviewed-By: Jeff Davis <pgsql@j-davis.com> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CAA4eK1+zO5LUeisabX10c81LU-fWMKO4M9Wyg1cdkbW7Hqh6vQ@mail.gmail.com The problem is that, on a standby, after the change - as needed to for the approach to work - the call to WalSndWakeup() in ApplyWalRecord() happens for every record, instead of only happening when the timeline is changed (or WAL is flushed or ...). WalSndWakeup() iterates over all walsender slots, regardless of whether in use. For each of the walsender slots it acquires a spinlock. When replaying a lot of small-ish WAL records I found the startup process to spend the majority of the time in WalSndWakeup(). I've not measured it very precisely yet, but the overhead is significant (~35% slowdown), even with the default max_wal_senders. If that's increased substantially, it obviously gets worse. The only saving grace is that this is not an issue on the primary. I unfortunately spent less time on this commit of the logical-decoding-on-standby series than on the others. There were so many other senior contributors discussing it, that I "relaxed" a bit too much. I don't think the approach of not having any sort of "registry" of whether anybody is waiting for the replay position to be updated is feasible. Iterating over all walsenders slots is just too expensive - WalSndWakeup() shows up even if I remove the spinlock (which we likely could, especially when just checking if the the walsender is connected). My current guess is that mis-using a condition variable is the best bet. I think it should work to use ConditionVariablePrepareToSleep() before a WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use ConditionVariableSleep(). The latch set from ConditionVariableBroadcast() would still cause the necessary wakeup. Greetings, Andres Freund
On Tue, 2023-05-09 at 12:02 -0700, Andres Freund wrote: > I don't think the approach of not having any sort of "registry" of > whether > anybody is waiting for the replay position to be updated is > feasible. Iterating over all walsenders slots is just too expensive - Would it work to use a shared counter for the waiters (or, two counters, one for physical and one for logical), and just early exit if the count is zero? Regards, Jeff Davis
Re: walsender performance regression due to logical decoding on standby changes
От
Andres Freund
Дата:
Hi, On 2023-05-09 13:38:24 -0700, Jeff Davis wrote: > On Tue, 2023-05-09 at 12:02 -0700, Andres Freund wrote: > > I don't think the approach of not having any sort of "registry" of > > whether > > anybody is waiting for the replay position to be updated is > > feasible. Iterating over all walsenders slots is just too expensive - > > Would it work to use a shared counter for the waiters (or, two > counters, one for physical and one for logical), and just early exit if > the count is zero? That doesn't really fix the problem - once you have a single walsender connected, performance is bad again. Greetings, Andres Freund
Re: walsender performance regression due to logical decoding on standby changes
От
Bharath Rupireddy
Дата:
On Wed, May 10, 2023 at 12:33 AM Andres Freund <andres@anarazel.de> wrote: > > Unfortunately I have found the following commit to have caused a performance > regression: > > commit e101dfac3a53c20bfbf1ca85d30a368c2954facf > > The problem is that, on a standby, after the change - as needed to for the > approach to work - the call to WalSndWakeup() in ApplyWalRecord() happens for > every record, instead of only happening when the timeline is changed (or WAL > is flushed or ...). > > WalSndWakeup() iterates over all walsender slots, regardless of whether in > use. For each of the walsender slots it acquires a spinlock. > > When replaying a lot of small-ish WAL records I found the startup process to > spend the majority of the time in WalSndWakeup(). I've not measured it very > precisely yet, but the overhead is significant (~35% slowdown), even with the > default max_wal_senders. If that's increased substantially, it obviously gets > worse. I played it with a simple primary -> standby1 -> standby2 setup. I ran a pgbench script [1] on primary and counted the number of times WalSndWakeup() gets called from ApplyWalRecord() and the number of times spinlock is acquired/released in WalSndWakeup(). It's a whopping 21 million times spinlock is acquired/released on the standby 1 and standby 2 for just a < 5min of pgbench run on the primary: standby 1: 2023-05-10 05:32:43.249 UTC [1595600] LOG: FOO WalSndWakeup() in ApplyWalRecord() was called 2176352 times 2023-05-10 05:32:43.249 UTC [1595600] LOG: FOO spinlock acquisition/release count in WalSndWakeup() is 21763530 standby 2: 2023-05-10 05:32:43.249 UTC [1595625] LOG: FOO WalSndWakeup() in ApplyWalRecord() was called 2176352 times 2023-05-10 05:32:43.249 UTC [1595625] LOG: FOO spinlock acquisition/release count in WalSndWakeup() is 21763530 In this case, there is no timeline switch or no logical decoding on the standby or such, but WalSndWakeup() gets called because the standby can't make out if the slot is for logical or physical replication unless spinlock is acquired. Before e101dfac3a, WalSndWakeup() was getting called only when there was a timeline switch. > The only saving grace is that this is not an issue on the primary. Yeah. > I don't think the approach of not having any sort of "registry" of whether > anybody is waiting for the replay position to be updated is > feasible. Iterating over all walsenders slots is just too expensive - > WalSndWakeup() shows up even if I remove the spinlock (which we likely could, > especially when just checking if the the walsender is connected). Right. > My current guess is that mis-using a condition variable is the best bet. I > think it should work to use ConditionVariablePrepareToSleep() before a > WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use > ConditionVariableSleep(). The latch set from ConditionVariableBroadcast() > would still cause the necessary wakeup. How about something like the attached? Recovery and subscription tests don't complain with the patch. [1] ./pg_ctl -D data -l logfile stop ./pg_ctl -D sbdata1 -l logfile1 stop ./pg_ctl -D sbdata2 -l logfile2 stop cd /home/ubuntu/postgres rm -rf inst make distclean ./configure --prefix=$PWD/inst/ CFLAGS="-O3" > install.log && make -j 8 install > install.log 2>&1 & primary -> standby1 -> standby2 free -m sudo su -c 'sync; echo 3 > /proc/sys/vm/drop_caches' free -m cd inst/bin ./initdb -D data mkdir archived_wal cat << EOF >> data/postgresql.conf shared_buffers = '8GB' wal_buffers = '1GB' max_wal_size = '16GB' max_connections = '5000' archive_mode = 'on' archive_command='cp %p /home/ubuntu/postgres/inst/bin/archived_wal/%f' EOF ./pg_ctl -D data -l logfile start ./pg_basebackup -D sbdata1 ./psql -c "select pg_create_physical_replication_slot('sb1_slot', true, false)" postgres cat << EOF >> sbdata1/postgresql.conf port=5433 primary_conninfo='host=localhost port=5432 dbname=postgres user=ubuntu application_name=sb1' primary_slot_name='sb1_slot' restore_command='cp /home/ubuntu/postgres/inst/bin/archived_wal/%f %p' EOF touch sbdata1/standby.signal ./pg_ctl -D sbdata1 -l logfile1 start ./pg_basebackup -D sbdata2 -p 5433 ./psql -p 5433 -c "select pg_create_physical_replication_slot('sb2_slot', true, false)" postgres cat << EOF >> sbdata2/postgresql.conf port=5434 primary_conninfo='host=localhost port=5433 dbname=postgres user=ubuntu application_name=sb2' primary_slot_name='sb2_slot' restore_command='cp /home/ubuntu/postgres/inst/bin/archived_wal/%f %p' EOF touch sbdata2/standby.signal ./pg_ctl -D sbdata2 -l logfile2 start ./psql -c "select pid, backend_type from pg_stat_activity where backend_type in ('walreceiver', 'walsender');" postgres ./psql -p 5433 -c "select pid, backend_type from pg_stat_activity where backend_type in ('walreceiver', 'walsender');" postgres ./psql -p 5434 -c "select pid, backend_type from pg_stat_activity where backend_type in ('walreceiver', 'walsender');" postgres ulimit -S -n 5000 ./pgbench --initialize --scale=300 postgres for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n "$c ";./pgbench -n -M prepared -U ubuntu postgres -b tpcb-like -c$c -j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
Re: walsender performance regression due to logical decoding on standby changes
От
"Drouvot, Bertrand"
Дата:
Hi, On 5/9/23 11:00 PM, Andres Freund wrote: > Hi, > > On 2023-05-09 13:38:24 -0700, Jeff Davis wrote: >> On Tue, 2023-05-09 at 12:02 -0700, Andres Freund wrote: >>> I don't think the approach of not having any sort of "registry" of >>> whether >>> anybody is waiting for the replay position to be updated is >>> feasible. Iterating over all walsenders slots is just too expensive - >> >> Would it work to use a shared counter for the waiters (or, two >> counters, one for physical and one for logical), and just early exit if >> the count is zero? > > That doesn't really fix the problem - once you have a single walsender > connected, performance is bad again. > Just to clarify, do you mean that if there is only one remaining active walsender that, say, has been located at slot n, then we’d still have to loop from 0 to n in WalSndWakeup()? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: walsender performance regression due to logical decoding on standby changes
От
"Drouvot, Bertrand"
Дата:
Hi, On 5/10/23 8:36 AM, Bharath Rupireddy wrote: > On Wed, May 10, 2023 at 12:33 AM Andres Freund <andres@anarazel.de> wrote: >> >> Unfortunately I have found the following commit to have caused a performance >> regression: >> >> commit e101dfac3a53c20bfbf1ca85d30a368c2954facf >> >> The problem is that, on a standby, after the change - as needed to for the >> approach to work - the call to WalSndWakeup() in ApplyWalRecord() happens for >> every record, instead of only happening when the timeline is changed (or WAL >> is flushed or ...). >> >> WalSndWakeup() iterates over all walsender slots, regardless of whether in >> use. For each of the walsender slots it acquires a spinlock. >> >> When replaying a lot of small-ish WAL records I found the startup process to >> spend the majority of the time in WalSndWakeup(). I've not measured it very >> precisely yet, but the overhead is significant (~35% slowdown), even with the >> default max_wal_senders. If that's increased substantially, it obviously gets >> worse. > Thanks Andres for the call out! I do agree that this is a concern. >> The only saving grace is that this is not an issue on the primary. > > Yeah. +1 > >> My current guess is that mis-using a condition variable is the best bet. I >> think it should work to use ConditionVariablePrepareToSleep() before a >> WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use >> ConditionVariableSleep(). The latch set from ConditionVariableBroadcast() >> would still cause the necessary wakeup. Yeah, I think that "mis-using" a condition variable is a valid option. Unless I'm missing something, I don't think there is anything wrong with using a CV that way (aka not using ConditionVariableTimedSleep() or ConditionVariableSleep() in this particular case). > > How about something like the attached? Recovery and subscription tests > don't complain with the patch. Thanks Bharath for looking at it! I launched a full Cirrus CI test with it but it failed on one environment (did not look in details, just sharing this here): https://cirrus-ci.com/task/6570140767092736 Also I have a few comments: @@ -1958,7 +1959,7 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl * ------ */ if (AllowCascadeReplication()) - WalSndWakeup(switchedTLI, true); + ConditionVariableBroadcast(&WalSndCtl->cv); I think the comment above this change needs to be updated. @@ -3368,9 +3370,13 @@ WalSndWait(uint32 socket_events, long timeout, uint32 wait_event) WaitEvent event; ModifyWaitEvent(FeBeWaitSet, FeBeWaitSetSocketPos, socket_events, NULL); + + ConditionVariablePrepareToSleep(&WalSndCtl->cv); if (WaitEventSetWait(FeBeWaitSet, timeout, &event, 1, wait_event) == 1 && (event.events & WL_POSTMASTER_DEATH)) proc_exit(1); + + ConditionVariableCancelSleep(); May be worth to update the comment above WalSndWait() too? (to explain why a CV handling is part of it). @@ -108,6 +109,8 @@ typedef struct */ bool sync_standbys_defined; + ConditionVariable cv; Worth to give it a more meaning full name? (to give a clue as when it is used) I think we also need to update the comment above WalSndWakeup(): " * For cascading replication we need to wake up physical walsenders separately * from logical walsenders (see the comment before calling WalSndWakeup() in * ApplyWalRecord() for more details). " Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
RE: walsender performance regression due to logical decoding on standby changes
От
"Zhijie Hou (Fujitsu)"
Дата:
On Wednesday, May 10, 2023 2:36 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Wed, May 10, 2023 at 12:33 AM Andres Freund <andres@anarazel.de> > wrote: > > > > Unfortunately I have found the following commit to have caused a > performance > > regression: > > > > commit e101dfac3a53c20bfbf1ca85d30a368c2954facf > > > > The problem is that, on a standby, after the change - as needed to for the > > approach to work - the call to WalSndWakeup() in ApplyWalRecord() > happens for > > every record, instead of only happening when the timeline is changed (or > WAL > > is flushed or ...). > > > > WalSndWakeup() iterates over all walsender slots, regardless of whether in > > use. For each of the walsender slots it acquires a spinlock. > > > > When replaying a lot of small-ish WAL records I found the startup process to > > spend the majority of the time in WalSndWakeup(). I've not measured it very > > precisely yet, but the overhead is significant (~35% slowdown), even with the > > default max_wal_senders. If that's increased substantially, it obviously gets > > worse. > > I played it with a simple primary -> standby1 -> standby2 setup. I ran > a pgbench script [1] on primary and counted the number of times > WalSndWakeup() gets called from ApplyWalRecord() and the number of > times spinlock is acquired/released in WalSndWakeup(). It's a whopping > 21 million times spinlock is acquired/released on the standby 1 and > standby 2 for just a < 5min of pgbench run on the primary: > > standby 1: > 2023-05-10 05:32:43.249 UTC [1595600] LOG: FOO WalSndWakeup() in > ApplyWalRecord() was called 2176352 times > 2023-05-10 05:32:43.249 UTC [1595600] LOG: FOO spinlock > acquisition/release count in WalSndWakeup() is 21763530 > > standby 2: > 2023-05-10 05:32:43.249 UTC [1595625] LOG: FOO WalSndWakeup() in > ApplyWalRecord() was called 2176352 times > 2023-05-10 05:32:43.249 UTC [1595625] LOG: FOO spinlock > acquisition/release count in WalSndWakeup() is 21763530 > > In this case, there is no timeline switch or no logical decoding on > the standby or such, but WalSndWakeup() gets called because the > standby can't make out if the slot is for logical or physical > replication unless spinlock is acquired. Before e101dfac3a, > WalSndWakeup() was getting called only when there was a timeline > switch. > > > The only saving grace is that this is not an issue on the primary. > > Yeah. > > > I don't think the approach of not having any sort of "registry" of whether > > anybody is waiting for the replay position to be updated is > > feasible. Iterating over all walsenders slots is just too expensive - > > WalSndWakeup() shows up even if I remove the spinlock (which we likely > could, > > especially when just checking if the the walsender is connected). > > Right. > > > My current guess is that mis-using a condition variable is the best bet. I > > think it should work to use ConditionVariablePrepareToSleep() before a > > WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use > > ConditionVariableSleep(). The latch set from ConditionVariableBroadcast() > > would still cause the necessary wakeup. > > How about something like the attached? Recovery and subscription tests > don't complain with the patch. Thanks for the patch. I noticed one place where the logic is different from before and just to confirm: if (AllowCascadeReplication()) - WalSndWakeup(switchedTLI, true); + ConditionVariableBroadcast(&WalSndCtl->cv); After the change, we wakeup physical walsender regardless of switchedTLI flag. Is this intentional ? if so, I think It would be better to update the comments above this. Best Regards, Hou zj
Re: walsender performance regression due to logical decoding on standby changes
От
Amit Kapila
Дата:
On Wed, May 10, 2023 at 3:41 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Wednesday, May 10, 2023 2:36 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > > > > My current guess is that mis-using a condition variable is the best bet. I > > > think it should work to use ConditionVariablePrepareToSleep() before a > > > WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use > > > ConditionVariableSleep(). The latch set from ConditionVariableBroadcast() > > > would still cause the necessary wakeup. > > > > How about something like the attached? Recovery and subscription tests > > don't complain with the patch. > > Thanks for the patch. I noticed one place where the logic is different from before and > just to confirm: > > if (AllowCascadeReplication()) > - WalSndWakeup(switchedTLI, true); > + ConditionVariableBroadcast(&WalSndCtl->cv); > > After the change, we wakeup physical walsender regardless of switchedTLI flag. > Is this intentional ? if so, I think It would be better to update the comments above this. > This raises the question of whether we need this condition variable logic for physical walsenders? -- With Regards, Amit Kapila.
Re: walsender performance regression due to logical decoding on standby changes
От
Masahiko Sawada
Дата:
On Wed, May 10, 2023 at 7:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, May 10, 2023 at 3:41 PM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > On Wednesday, May 10, 2023 2:36 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > > > > > > > My current guess is that mis-using a condition variable is the best bet. I > > > > think it should work to use ConditionVariablePrepareToSleep() before a > > > > WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use > > > > ConditionVariableSleep(). The latch set from ConditionVariableBroadcast() > > > > would still cause the necessary wakeup. > > > > > > How about something like the attached? Recovery and subscription tests > > > don't complain with the patch. > > > > Thanks for the patch. I noticed one place where the logic is different from before and > > just to confirm: > > > > if (AllowCascadeReplication()) > > - WalSndWakeup(switchedTLI, true); > > + ConditionVariableBroadcast(&WalSndCtl->cv); > > > > After the change, we wakeup physical walsender regardless of switchedTLI flag. > > Is this intentional ? if so, I think It would be better to update the comments above this. > > > > This raises the question of whether we need this condition variable > logic for physical walsenders? It sounds like a good idea. We can have two condition variables for logical and physical walsenders, and selectively wake up walsenders sleeping on the condition variables. It should work, it seems like much of a hack, though. Regards, [1] https://www.postgresql.org/message-id/2d314c22b9e03415aa1c7d8fd1f698dae60effa7.camel%40j-davis.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
RE: walsender performance regression due to logical decoding on standby changes
От
"Zhijie Hou (Fujitsu)"
Дата:
On Wednesday, May 10, 2023 3:03 AM Andres Freund <andres@anarazel.de> wrote: > Hi, > > Unfortunately I have found the following commit to have caused a > performance > regression: > > commit e101dfac3a53c20bfbf1ca85d30a368c2954facf > Author: Andres Freund <andres@anarazel.de> > Date: 2023-04-08 00:24:24 -0700 > > For cascading replication, wake physical and logical walsenders separately > > Physical walsenders can't send data until it's been flushed; logical > walsenders can't decode and send data until it's been applied. On the > standby, the WAL is flushed first, which will only wake up physical > walsenders; and then applied, which will only wake up logical > walsenders. > > Previously, all walsenders were awakened when the WAL was flushed. That > was fine for logical walsenders on the primary; but on the standby the > flushed WAL would have been not applied yet, so logical walsenders were > awakened too early. > > Per idea from Jeff Davis and Amit Kapila. > > Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com> > Reviewed-By: Jeff Davis <pgsql@j-davis.com> > Reviewed-By: Robert Haas <robertmhaas@gmail.com> > Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> > Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> > Discussion: > https://postgr.es/m/CAA4eK1+zO5LUeisabX10c81LU-fWMKO4M9Wyg1cdkb > W7Hqh6vQ@mail.gmail.com > > The problem is that, on a standby, after the change - as needed to for the > approach to work - the call to WalSndWakeup() in ApplyWalRecord() happens > for > every record, instead of only happening when the timeline is changed (or WAL > is flushed or ...). > > WalSndWakeup() iterates over all walsender slots, regardless of whether in > use. For each of the walsender slots it acquires a spinlock. > > When replaying a lot of small-ish WAL records I found the startup process to > spend the majority of the time in WalSndWakeup(). I've not measured it very > precisely yet, but the overhead is significant (~35% slowdown), even with the > default max_wal_senders. If that's increased substantially, it obviously gets > worse. I did some simple tests for this to see the performance impact on the streaming replication, just share it here for reference. 1) sync primary-standby setup, load data on primary and count the time spent on replication. the degradation will be more obvious as the value of max_wal_senders increases. max_wal_senders before(ms) after(ms) degradation 100 13394.4013 14141.2615 5.58% 200 13280.8507 14597.1173 9.91% 300 13535.0232 16735.7379 23.65% 2) Similar as 1) but count the time that the standby startup process spent on replaying WAL(via gprof). 10 senders =========== before % cumulative self self total time seconds seconds calls s/call s/call name 4.12 0.45 0.11 1 0.11 2.46 PerformWalRecovery after % cumulative self self total time seconds seconds calls s/call s/call name 17.99 0.59 0.59 4027383 0.00 0.00 WalSndWakeup 8.23 0.86 0.27 1 0.27 3.11 PerformWalRecovery 100 senders =========== before % cumulative self self total time seconds seconds calls s/call s/call name 5.56 0.36 0.18 1 0.18 2.91 PerformWalRecovery after % cumulative self self total time seconds seconds calls s/call s/call name 64.65 4.39 4.39 4027383 0.00 0.00 WalSndWakeup 2.95 4.59 0.20 1 0.20 6.62 PerformWalRecovery Will test after applying the latest patch in this thread later. Best Regards, Hou zj
Re: walsender performance regression due to logical decoding on standby changes
От
Bharath Rupireddy
Дата:
On Wed, May 10, 2023 at 3:23 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > >> My current guess is that mis-using a condition variable is the best bet. I > >> think it should work to use ConditionVariablePrepareToSleep() before a > >> WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use > >> ConditionVariableSleep(). The latch set from ConditionVariableBroadcast() > >> would still cause the necessary wakeup. > > > > How about something like the attached? Recovery and subscription tests > > don't complain with the patch. > > I launched a full Cirrus CI test with it but it failed on one environment (did not look in details, > just sharing this here): https://cirrus-ci.com/task/6570140767092736 Yeah, v1 had ConditionVariableInit() such that the CV was getting initialized for every backend as opposed to just once after the WAL sender shmem was created. > Also I have a few comments: Indeed, v1 was a WIP patch. Please have a look at the attached v2 patch, which has comments and passing CI runs on all platforms - https://github.com/BRupireddy/postgres/tree/optimize_walsender_wakeup_logic_v2. On Wed, May 10, 2023 at 3:41 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > if (AllowCascadeReplication()) > - WalSndWakeup(switchedTLI, true); > + ConditionVariableBroadcast(&WalSndCtl->cv); > > After the change, we wakeup physical walsender regardless of switchedTLI flag. > Is this intentional ? if so, I think It would be better to update the comments above this. That's not the case with the attached v2 patch. Please have a look. On Thu, May 11, 2023 at 10:27 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > We can have two condition variables for > logical and physical walsenders, and selectively wake up walsenders > sleeping on the condition variables. It should work, it seems like > much of a hack, though. Andres, rightly put it - 'mis-using' CV infrastructure. It is simple, works, and makes the WalSndWakeup() easy solving the performance regression. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
RE: walsender performance regression due to logical decoding on standby changes
От
"Zhijie Hou (Fujitsu)"
Дата:
On Friday, May 12, 2023 7:58 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Wed, May 10, 2023 at 3:23 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > > > >> My current guess is that mis-using a condition variable is the best > > >> bet. I think it should work to use > > >> ConditionVariablePrepareToSleep() before a WalSndWait(), and then > > >> ConditionVariableCancelSleep(). I.e. to never use > > >> ConditionVariableSleep(). The latch set from > ConditionVariableBroadcast() would still cause the necessary wakeup. > > > > > > How about something like the attached? Recovery and subscription > > > tests don't complain with the patch. > > > > I launched a full Cirrus CI test with it but it failed on one > > environment (did not look in details, just sharing this here): > > https://cirrus-ci.com/task/6570140767092736 > > Yeah, v1 had ConditionVariableInit() such that the CV was getting initialized for > every backend as opposed to just once after the WAL sender shmem was > created. > > > Also I have a few comments: > > Indeed, v1 was a WIP patch. Please have a look at the attached v2 patch, which > has comments and passing CI runs on all platforms - > https://github.com/BRupireddy/postgres/tree/optimize_walsender_wakeup_ > logic_v2. > > On Wed, May 10, 2023 at 3:41 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> > wrote: > > > > if (AllowCascadeReplication()) > > - WalSndWakeup(switchedTLI, true); > > + ConditionVariableBroadcast(&WalSndCtl->cv); > > > > After the change, we wakeup physical walsender regardless of switchedTLI > flag. > > Is this intentional ? if so, I think It would be better to update the comments > above this. > > That's not the case with the attached v2 patch. Please have a look. Thanks for updating the patch. I did some simple primary->standby replication test for the patch and can see the degradation doesn't happen in the replication after applying it[1]. One nitpick in the comment: + * walsenders. It makes WalSndWakeup() callers life easy. callers life easy => callers' life easy. [1] max_wal_senders = 100 before regression(ms) after regression(ms) v2 patch(ms) 13394.4013 14141.2615 13455.2543 Compared with before regression 5.58% 0.45% max_wal_senders = 200 before regression(ms) after regression(ms) v2 patch(ms) 13280.8507 14597.1173 13632.0606 Compared with before regression 9.91% 1.64% max_wal_senders = 300 before regression(ms) after regression(ms) v2 patch(ms) 13535.0232 16735.7379 13705.7135 Compared with before regression 23.65% 1.26% Best Regards, Hou zj
Re: walsender performance regression due to logical decoding on standby changes
От
Thomas Munro
Дата:
On Fri, May 12, 2023 at 11:58 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > Andres, rightly put it - 'mis-using' CV infrastructure. It is simple, > works, and makes the WalSndWakeup() easy solving the performance > regression. Yeah, this seems OK, and better than the complicated alternatives. If one day we want to implement CVs some other way so that this I-know-that-CVs-are-really-made-out-of-latches abstraction leak becomes a problem, and we still need this, well then we can make a separate latch-wait-list thing.
Re: walsender performance regression due to logical decoding on standby changes
От
"Drouvot, Bertrand"
Дата:
Hi, On 5/15/23 4:19 AM, Zhijie Hou (Fujitsu) wrote: > On Friday, May 12, 2023 7:58 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: >> >> On Wed, May 10, 2023 at 3:23 PM Drouvot, Bertrand >> >> That's not the case with the attached v2 patch. Please have a look. > Thanks for V2! It does look good to me and I like the fact that WalSndWakeup() does not need to loop on all the Walsenders slot anymore (for both the physical and logical cases). > Thanks for updating the patch. I did some simple primary->standby replication test for the > patch and can see the degradation doesn't happen in the replication after applying it[1]. > Thanks for the performance regression measurement! Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: walsender performance regression due to logical decoding on standby changes
От
Masahiko Sawada
Дата:
On Mon, May 15, 2023 at 1:49 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Fri, May 12, 2023 at 11:58 PM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > Andres, rightly put it - 'mis-using' CV infrastructure. It is simple, > > works, and makes the WalSndWakeup() easy solving the performance > > regression. > > Yeah, this seems OK, and better than the complicated alternatives. If > one day we want to implement CVs some other way so that this > I-know-that-CVs-are-really-made-out-of-latches abstraction leak > becomes a problem, and we still need this, well then we can make a > separate latch-wait-list thing. +1 Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: walsender performance regression due to logical decoding on standby changes
От
Bharath Rupireddy
Дата:
On Mon, May 15, 2023 at 6:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, May 15, 2023 at 1:49 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > > > On Fri, May 12, 2023 at 11:58 PM Bharath Rupireddy > > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > Andres, rightly put it - 'mis-using' CV infrastructure. It is simple, > > > works, and makes the WalSndWakeup() easy solving the performance > > > regression. > > > > Yeah, this seems OK, and better than the complicated alternatives. If > > one day we want to implement CVs some other way so that this > > I-know-that-CVs-are-really-made-out-of-latches abstraction leak > > becomes a problem, and we still need this, well then we can make a > > separate latch-wait-list thing. > > +1 Thanks for acknowledging the approach. On Mon, May 15, 2023 at 2:11 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Thanks for V2! It does look good to me and I like the fact that > WalSndWakeup() does not need to loop on all the Walsenders slot > anymore (for both the physical and logical cases). Indeed, it doesn't have to. On Mon, May 15, 2023 at 7:50 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > Thanks for updating the patch. I did some simple primary->standby replication test for the > patch and can see the degradation doesn't happen in the replication after applying it[1]. > > One nitpick in the comment: > > + * walsenders. It makes WalSndWakeup() callers life easy. > > callers life easy => callers' life easy. Changed. > [1] > max_wal_senders = 100 > before regression(ms) after regression(ms) v2 patch(ms) > 13394.4013 14141.2615 13455.2543 > Compared with before regression 5.58% 0.45% > > max_wal_senders = 200 > before regression(ms) after regression(ms) v2 patch(ms) > 13280.8507 14597.1173 13632.0606 > Compared with before regression 9.91% 1.64% > > max_wal_senders = 300 > before regression(ms) after regression(ms) v2 patch(ms) > 13535.0232 16735.7379 13705.7135 > Compared with before regression 23.65% 1.26% Yes, the numbers with v2 patch look close to where we were before. Thanks for confirming. Just wondering, where does this extra 0.45%/1.64%/1.26% coming from? Please find the attached v3 with the review comment addressed. Do we need to add an open item for this issue in https://wiki.postgresql.org/wiki/PostgreSQL_16_Open_Items? If yes, can anyone in this loop add one? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
Re: walsender performance regression due to logical decoding on standby changes
От
"Drouvot, Bertrand"
Дата:
Hi, On 5/15/23 4:39 PM, Bharath Rupireddy wrote: > On Mon, May 15, 2023 at 6:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> >> On Mon, May 15, 2023 at 1:49 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > Do we need to add an open item for this issue in > https://wiki.postgresql.org/wiki/PostgreSQL_16_Open_Items? If yes, can > anyone in this loop add one? I do think we need one for this issue and then just added it. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: walsender performance regression due to logical decoding on standby changes
От
Andres Freund
Дата:
Hi, On 2023-05-10 08:39:08 +0200, Drouvot, Bertrand wrote: > On 5/9/23 11:00 PM, Andres Freund wrote: > > Hi, > > > > On 2023-05-09 13:38:24 -0700, Jeff Davis wrote: > > > On Tue, 2023-05-09 at 12:02 -0700, Andres Freund wrote: > > > > I don't think the approach of not having any sort of "registry" of > > > > whether > > > > anybody is waiting for the replay position to be updated is > > > > feasible. Iterating over all walsenders slots is just too expensive - > > > > > > Would it work to use a shared counter for the waiters (or, two > > > counters, one for physical and one for logical), and just early exit if > > > the count is zero? > > > > That doesn't really fix the problem - once you have a single walsender > > connected, performance is bad again. > > > > Just to clarify, do you mean that if there is only one remaining active walsender that, say, > has been located at slot n, then we’d still have to loop from 0 to n in WalSndWakeup()? I understood Jeff's proposal to just have an early exit if there are no walsenders connected at all. But yes, even if we stopped iterating after finding the number of slots we needed to, having to iterate over empty slots would be an issue. But TBH, even if we only did work for connected walsenders, I think this would still be a performance issue. Acquiring O(#connected-walsenders) spinlocks for every record is just too expensive. Greetings, Andres Freund
Re: walsender performance regression due to logical decoding on standby changes
От
Andres Freund
Дата:
Hi, On 2023-05-11 09:42:39 +0000, Zhijie Hou (Fujitsu) wrote: > I did some simple tests for this to see the performance impact on > the streaming replication, just share it here for reference. > > 1) sync primary-standby setup, load data on primary and count the time spent on > replication. the degradation will be more obvious as the value of max_wal_senders > increases. FWIW, using syncrep likely under-estimates the overhead substantially, because that includes a lot overhead on the WAL generating side. I saw well over 20% overhead for the default max_wal_senders=10. I just created a standby, shut it down, then ran a deterministically-sized workload on the primary, started the standby, and measured how long it took to catch up. I just used the log messages to measure the time. > 2) Similar as 1) but count the time that the standby startup process spent on > replaying WAL(via gprof). I don't think that's the case here, but IME gprof's overhead is so high, that it can move bottlenecks quite drastically. The problem is that it adds code to every function enter/exit - for simple functions, that overhead is much higher than the "original" cost of the function. gprof style instrumentation is good for things like code coverage, but for performance evaluation it's normally better to use a sampling profiler like perf. That also causes slowdowns, but largely only in places that already take up substantial execution time. Greetings, Andres Freund
On Wed, 2023-05-17 at 12:34 -0700, Andres Freund wrote: > I understood Jeff's proposal to just have an early exit if there are > no > walsenders connected at all. My suggestion was we early exit unless there is at least one *waiting* walsender of the appropriate type. In other words, increment the counter on entry to WalSndWait() and decrement on exit. I didn't look in detail yet, so perhaps it's not easy/cheap to do that safely. Regards, Jeff Davis
Re: walsender performance regression due to logical decoding on standby changes
От
Andres Freund
Дата:
Hi, Thanks for working on the patch! On 2023-05-15 20:09:00 +0530, Bharath Rupireddy wrote: > > [1] > > max_wal_senders = 100 > > before regression(ms) after regression(ms) v2 patch(ms) > > 13394.4013 14141.2615 13455.2543 > > Compared with before regression 5.58% 0.45% > > > > max_wal_senders = 200 > > before regression(ms) after regression(ms) v2 patch(ms) > > 13280.8507 14597.1173 13632.0606 > > Compared with before regression 9.91% 1.64% > > > > max_wal_senders = 300 > > before regression(ms) after regression(ms) v2 patch(ms) > > 13535.0232 16735.7379 13705.7135 > > Compared with before regression 23.65% 1.26% > > Yes, the numbers with v2 patch look close to where we were before. > Thanks for confirming. Just wondering, where does this extra > 0.45%/1.64%/1.26% coming from? We still do more work for each WAL record than before, so I'd expect something small. I'd say right now the main overhead with the patch comes from the spinlock acquisitions in ConditionVariableBroadcast(), which happen even when nobody is waiting. I'll try to come up with a benchmark without the issues I pointed out in https://postgr.es/m/20230517194331.ficfy5brpfq5lrmz%40awork3.anarazel.de > + ConditionVariableInit(&WalSndCtl->physicalWALSndCV); > + ConditionVariableInit(&WalSndCtl->logicalWALSndCV); It's not obvious to me that it's worth having two CVs, because it's more expensive to find no waiters in two CVs than to find no waiters in one CV. > + /* > + * We use condition variable (CV) to efficiently wake up walsenders in > + * WalSndWakeup(). > + * > + * Every walsender prepares to sleep on a shared memory CV. Note that it > + * just prepares to sleep on the CV (i.e., adds itself to the CV's > + * waitlist), but not actually waits on the CV (IOW, it never calls > + * ConditionVariableSleep()). It still uses WaitEventSetWait() for waiting, > + * because CV infrastructure doesn't handle FeBe socket events currently. > + * The processes (startup process, walreceiver etc.) wanting to wake up > + * walsenders use ConditionVariableBroadcast(), which in turn calls > + * SetLatch(), helping walsenders come out of WaitEventSetWait(). > + * > + * This approach is simple and efficient because, one doesn't have to loop > + * through all the walsenders slots, with a spinlock acquisition and > + * release for every iteration, just to wake up only the waiting > + * walsenders. It makes WalSndWakeup() callers' life easy. > + * > + * XXX: When available, WaitEventSetWait() can be replaced with its CV's > + * counterpart. I don't really understand that XXX - the potential bright future would be to add support for CVs into wait event sets, not to replace WES with a CV? Greetings, Andres Freund
Re: walsender performance regression due to logical decoding on standby changes
От
Andres Freund
Дата:
Hi, On 2023-05-17 12:53:15 -0700, Andres Freund wrote: > I'll try to come up with a benchmark without the issues I pointed out in > https://postgr.es/m/20230517194331.ficfy5brpfq5lrmz%40awork3.anarazel.de Here we go: setup: create primary SELECT pg_create_physical_replication_slot('reserve', true); create standby using pg_basebackup create WAL: psql -c CREATE TABLE testtable_logged(other_data int default 1);' && \ c=16; PGOPTIONS='-c synchronous_commit=off' /path-to-pgbench --random-seed=0 -n -c$c -j$c -t1000 -P1 -f <( echo "INSERTINTO testtable_logged SELECT generate_series(1, 1000)" ) && \ psql -c "SELECT pg_create_restore_point('end');" benchmark: rm -rf /tmp/test && \ cp -ar /srv/dev/pgdev-dev-standby /tmp/test && \ cp -ar /srv/dev/pgdev-dev/pg_wal/* /tmp/test/pg_wal/ && \ sync && \ /usr/bin/time -f '%es' /path-to-postgres -D /tmp/test -c recovery_target_action=shutdown -c recovery_target_name=end -cshared_buffers=1GB -c fsync=off That way I can measure how long it takes to replay exactly the same WAL, and also take profiles of exactly the same work, without influencing time results. I copy the WAL files to the primary to ensure that walreceiver (standby) / walsender (primary) performance doesn't make the result variability higher. max_walsenders=10 max_walsenders=100 e101dfac3a5 reverted 7.01s 7.02s 093e5c57d50 / HEAD 8.25s 19.91s bharat-v3 7.14s 7.13s So indeed, bharat-v3 largely fixes the issue. The regression of v3 compared to e101dfac3a5 reverted seems pretty constant at ~0.982x, independent of the concrete max_walsenders value. Which makes sense, the work is constant. To make it more extreme, I also tested a workload that is basically free to replay: c=16; /srv/dev/build/m-opt/src/bin/pgbench/pgbench --random-seed=0 -n -c$c -j$c -t5000 -P1 -f <( echo "SELECT pg_logical_emit_message(false,'c', 'a') FROM generate_series(1, 1000)" ) && psql -c "SELECT pg_create_restore_point('end');" max_walsenders=10 max_walsenders=100 e101dfac3a5 reverted 1.70s 1.70s 093e5c57d50 / HEAD 3.00s 14.56s bharat-v3 1.88s 1.88s In this extreme workload we still regress by ~0.904x. I'm not sure how much it's worth worrying about that - this is a quite unrealistic testcase. FWIW, if I just make WalSndWakeup() do nothing, I still see a very small, but reproducible, overhead: 1.72s - that's just the cost of the additional external function call. If I add a no-waiters fastpath using proclist_is_empty() to ConditionVariableBroadcast(), I get 1.77s. So the majority of the remaining slowdown indeed comes from the spinlock acquisition in ConditionVariableBroadcast(). Greetings, Andres Freund
Re: walsender performance regression due to logical decoding on standby changes
От
Bharath Rupireddy
Дата:
On Thu, May 18, 2023 at 1:23 AM Andres Freund <andres@anarazel.de> wrote: > > On 2023-05-15 20:09:00 +0530, Bharath Rupireddy wrote: > > > [1] > > > max_wal_senders = 100 > > > before regression(ms) after regression(ms) v2 patch(ms) > > > 13394.4013 14141.2615 13455.2543 > > > Compared with before regression 5.58% 0.45% > > > > > > max_wal_senders = 200 > > > before regression(ms) after regression(ms) v2 patch(ms) > > > 13280.8507 14597.1173 13632.0606 > > > Compared with before regression 9.91% 1.64% > > > > > > max_wal_senders = 300 > > > before regression(ms) after regression(ms) v2 patch(ms) > > > 13535.0232 16735.7379 13705.7135 > > > Compared with before regression 23.65% 1.26% > > > > Yes, the numbers with v2 patch look close to where we were before. > > Thanks for confirming. Just wondering, where does this extra > > 0.45%/1.64%/1.26% coming from? > > We still do more work for each WAL record than before, so I'd expect something > small. I'd say right now the main overhead with the patch comes from the > spinlock acquisitions in ConditionVariableBroadcast(), which happen even when > nobody is waiting. > > + ConditionVariableInit(&WalSndCtl->physicalWALSndCV); > > + ConditionVariableInit(&WalSndCtl->logicalWALSndCV); > > It's not obvious to me that it's worth having two CVs, because it's more > expensive to find no waiters in two CVs than to find no waiters in one CV. I disagree. In the tight per-WAL record recovery loop, WalSndWakeup wakes up logical walsenders for every WAL record, but it wakes up physical walsenders only if the applied WAL record causes a TLI switch. Therefore, the extra cost of spinlock acquire-release for per WAL record applies only for logical walsenders. On the other hand, if we were to use a single CV, we would be unnecessarily waking up (if at all they are sleeping) physical walsenders for every WAL record - which is costly IMO. I still think separate CVs are good for selective wake ups given there can be not so many TLI switch WAL records. > > + * > > + * XXX: When available, WaitEventSetWait() can be replaced with its CV's > > + * counterpart. > > I don't really understand that XXX - the potential bright future would be to > add support for CVs into wait event sets, not to replace WES with a CV? Yes, I meant it and modified that part to: * XXX: A desirable future improvement would be to add support for CVs into * WaitEventSetWait(). > FWIW, if I just make WalSndWakeup() do nothing, I still see a very small, but > reproducible, overhead: 1.72s - that's just the cost of the additional > external function call. Hm, this is unavoidable. > If I add a no-waiters fastpath using proclist_is_empty() to > ConditionVariableBroadcast(), I get 1.77s. So the majority of the remaining > slowdown indeed comes from the spinlock acquisition in > ConditionVariableBroadcast(). Acquiring spinlock for every replayed WAL record seems not great. In general, this problem exists for CV infrastructure as a whole if ConditionVariableBroadcast()/CVB() is called in a loop/hot path. I think this can be proven easily by doing something like - select pg_replication_origin_session_setup()/pg_wal_replay_resume()/pg_create_physical_replication_slot() from generate_series(1, 1000000...);, all of these functions end up in CVB(). Do you think adding a fastpath exit when no waiters to CVB() is worth implementing? Something like - an atomic state variable to each CV, similar to LWLock's state variable, setting it when adding waiters and resetting it when removing waiters, CVB() atomically reading the state for fastpath (of course, we need memory barriers here). This might complicate things as each CV structure gets an extra state variable (4 bytes), memory barriers, and atomic writes and reads. Or given the current uses of CVB() with no callers except recovery calling it in a hot path with patch, maybe we can add an atomic waiter count to WalSndCtl, incrementing it atomically in WalSndWait() before the wait, decrementing it after the wait, and a fastpath in WalSndWakeup() reading it atomically to avoid CVB() calls. IIUC, this is something Jeff proposed upthread. Thoughts? Please find the attached v4 patch addressing the review comment (not the fastpath one). -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
Re: walsender performance regression due to logical decoding on standby changes
От
Kyotaro Horiguchi
Дата:
At Thu, 18 May 2023 20:11:11 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in > > > + ConditionVariableInit(&WalSndCtl->physicalWALSndCV); > > > + ConditionVariableInit(&WalSndCtl->logicalWALSndCV); > > > > It's not obvious to me that it's worth having two CVs, because it's more > > expensive to find no waiters in two CVs than to find no waiters in one CV. > > I disagree. In the tight per-WAL record recovery loop, WalSndWakeup > wakes up logical walsenders for every WAL record, but it wakes up > physical walsenders only if the applied WAL record causes a TLI > switch. Therefore, the extra cost of spinlock acquire-release for per > WAL record applies only for logical walsenders. On the other hand, if > we were to use a single CV, we would be unnecessarily waking up (if at > all they are sleeping) physical walsenders for every WAL record - > which is costly IMO. As I was reading this, I start thinking that one reason for the regression could be to exccessive frequency of wakeups during logical replication. In physical replication, we make sure to avoid exccessive wakeups when the stream is tightly packed. I'm just wondering why logical replication doesn't (or can't) do the same thing, IMHO. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
RE: walsender performance regression due to logical decoding on standby changes
От
"Zhijie Hou (Fujitsu)"
Дата:
On Friday, May 19, 2023 11:08 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > At Thu, 18 May 2023 20:11:11 +0530, Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote in > > > > + ConditionVariableInit(&WalSndCtl->physicalWALSndCV); > > > > + ConditionVariableInit(&WalSndCtl->logicalWALSndCV); > > > > > > It's not obvious to me that it's worth having two CVs, because it's more > > > expensive to find no waiters in two CVs than to find no waiters in one CV. > > > > I disagree. In the tight per-WAL record recovery loop, WalSndWakeup > > wakes up logical walsenders for every WAL record, but it wakes up > > physical walsenders only if the applied WAL record causes a TLI > > switch. Therefore, the extra cost of spinlock acquire-release for per > > WAL record applies only for logical walsenders. On the other hand, if > > we were to use a single CV, we would be unnecessarily waking up (if at > > all they are sleeping) physical walsenders for every WAL record - > > which is costly IMO. > > As I was reading this, I start thinking that one reason for the > regression could be to exccessive frequency of wakeups during logical > replication. In physical replication, we make sure to avoid exccessive > wakeups when the stream is tightly packed. I'm just wondering why > logical replication doesn't (or can't) do the same thing, IMHO. I thought(from the e101dfa's commit message) physical walsenders can't send data until it's been flushed, so it only gets wakeup at the time of flush or TLI switch. For logical walsender, it can start to decode changes after the change is applied, and to avoid the case if the walsender is asleep, and there's work to be done, it wakeup logical walsender when applying each record. Or maybe you mean to wakeup(ConditionVariableBroadcast) walsender after applying some wal records like[1], But it seems it may delay the wakeup of walsender a bit(the walsender may be asleep before reaching the threshold). [1] --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -1833,6 +1833,7 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl { ErrorContextCallback errcallback; bool switchedTLI = false; + static int nreplay = 0; /* Setup error traceback support for ereport() */ errcallback.callback = rm_redo_error_callback; @@ -1957,8 +1958,12 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl * be created otherwise) * ------ */ - if (AllowCascadeReplication()) + if (AllowCascadeReplication() && + nreplay++ == 100) + { WalSndWakeup(switchedTLI, true); + nreplay = 0; + } Best Regards, Hou zj
Re: walsender performance regression due to logical decoding on standby changes
От
Andres Freund
Дата:
Hi, On 2023-05-19 12:07:56 +0900, Kyotaro Horiguchi wrote: > At Thu, 18 May 2023 20:11:11 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in > > > > + ConditionVariableInit(&WalSndCtl->physicalWALSndCV); > > > > + ConditionVariableInit(&WalSndCtl->logicalWALSndCV); > > > > > > It's not obvious to me that it's worth having two CVs, because it's more > > > expensive to find no waiters in two CVs than to find no waiters in one CV. > > > > I disagree. In the tight per-WAL record recovery loop, WalSndWakeup > > wakes up logical walsenders for every WAL record, but it wakes up > > physical walsenders only if the applied WAL record causes a TLI > > switch. Therefore, the extra cost of spinlock acquire-release for per > > WAL record applies only for logical walsenders. On the other hand, if > > we were to use a single CV, we would be unnecessarily waking up (if at > > all they are sleeping) physical walsenders for every WAL record - > > which is costly IMO. > > As I was reading this, I start thinking that one reason for the > regression could be to exccessive frequency of wakeups during logical > replication. In physical replication, we make sure to avoid exccessive > wakeups when the stream is tightly packed. I'm just wondering why > logical replication doesn't (or can't) do the same thing, IMHO. It's possible we could try to reduce the frequency by issuing wakeups only at specific points. The most obvious thing to do would be to wake only when waiting for more WAL or when crossing a page boundary, or such. Unfortunately that could easily lead to deadlocks, because the startup process might be blocked waiting for a lock, held by a backend doing logical decoding - which can't progress until the startup process wakes the backend up. So I don't think this is promising avenue in the near term. Greetings, Andres Freund
Re: walsender performance regression due to logical decoding on standby changes
От
Andres Freund
Дата:
Hi, On 2023-05-18 20:11:11 +0530, Bharath Rupireddy wrote: > Please find the attached v4 patch addressing the review comment (not > the fastpath one). I pushed a mildly edited version of this. I didn't like the name of the CVs much, so I renamed them to wal_flush_cv/wal_replay_cv. I also did some minor comment polishing. Greetings, Andres Freund
RE: walsender performance regression due to logical decoding on standby changes
От
"Zhijie Hou (Fujitsu)"
Дата:
Hi, On Monday, May 22, 2023 12:11 AM Andres Freund <andres@anarazel.de> wrote: > On 2023-05-19 12:07:56 +0900, Kyotaro Horiguchi wrote: > > At Thu, 18 May 2023 20:11:11 +0530, Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote in > > > > > + > ConditionVariableInit(&WalSndCtl->physicalWALSndCV); > > > > > + ConditionVariableInit(&WalSndCtl->logicalWALSndCV); > > > > > > > > It's not obvious to me that it's worth having two CVs, because it's more > > > > expensive to find no waiters in two CVs than to find no waiters in one CV. > > > > > > I disagree. In the tight per-WAL record recovery loop, WalSndWakeup > > > wakes up logical walsenders for every WAL record, but it wakes up > > > physical walsenders only if the applied WAL record causes a TLI > > > switch. Therefore, the extra cost of spinlock acquire-release for per > > > WAL record applies only for logical walsenders. On the other hand, if > > > we were to use a single CV, we would be unnecessarily waking up (if at > > > all they are sleeping) physical walsenders for every WAL record - > > > which is costly IMO. > > > > As I was reading this, I start thinking that one reason for the > > regression could be to exccessive frequency of wakeups during logical > > replication. In physical replication, we make sure to avoid exccessive > > wakeups when the stream is tightly packed. I'm just wondering why > > logical replication doesn't (or can't) do the same thing, IMHO. > > It's possible we could try to reduce the frequency by issuing wakeups only at > specific points. The most obvious thing to do would be to wake only when > waiting for more WAL or when crossing a page boundary, or such. > Unfortunately > that could easily lead to deadlocks, because the startup process might be > blocked waiting for a lock, held by a backend doing logical decoding - which > can't progress until the startup process wakes the backend up. Just out of curiosity about the mentioned deadlock scenario, no comment on the committed patch. About "a backend doing logical decoding", do you mean the case when a user start a backend and invoke pg_logical_slot_get_changes() to do the logical decoding ? If so, it seems the logical decoding in a backend won't be waked up by startup process because the backend won't be registered as a walsender so the backend won't be found in WalSndWakeup(). Or do you mean the deadlock between the real logical walsender and startup process ? (I might miss something) I think the logical decoding doesn't lock the target user relation when decoding because it normally can get the needed information from WAL. Besides, the walsender sometimes will lock the system table(e.g. use RelidByRelfilenumber() to get the relid) but it will unlock it after finishing systable scan. So, if possible, would you be able to share some details about the deadlock case you mentioned earlier? it's helpful as it can prevent similar problems in the future development. Best Regards, Hou zj
Re: walsender performance regression due to logical decoding on standby changes
От
Andres Freund
Дата:
Hi, On 2023-05-22 12:15:07 +0000, Zhijie Hou (Fujitsu) wrote: > About "a backend doing logical decoding", do you mean the case when a user > start a backend and invoke pg_logical_slot_get_changes() to do the logical > decoding ? If so, it seems the logical decoding in a backend won't be waked up > by startup process because the backend won't be registered as a walsender so > the backend won't be found in WalSndWakeup(). I meant logical decoding happening inside a walsender instance. > Or do you mean the deadlock between the real logical walsender and startup > process ? (I might miss something) I think the logical decoding doesn't lock > the target user relation when decoding because it normally can get the needed > information from WAL. It does lock catalog tables briefly. There's no guarantee that such locks are released immediately. I forgot the details, but IIRC there's some outfuncs (enum?) that intentionally delay releasing locks till transaction commit. Greetings, Andres Freund
RE: walsender performance regression due to logical decoding on standby changes
От
"Zhijie Hou (Fujitsu)"
Дата:
On Tuesday, May 23, 2023 1:53 AM Andres Freund <andres@anarazel.de> wrote: > On 2023-05-22 12:15:07 +0000, Zhijie Hou (Fujitsu) wrote: > > About "a backend doing logical decoding", do you mean the case when a > user > > start a backend and invoke pg_logical_slot_get_changes() to do the logical > > decoding ? If so, it seems the logical decoding in a backend won't be waked > up > > by startup process because the backend won't be registered as a walsender > so > > the backend won't be found in WalSndWakeup(). > > I meant logical decoding happening inside a walsender instance. > > > > Or do you mean the deadlock between the real logical walsender and startup > > process ? (I might miss something) I think the logical decoding doesn't lock > > the target user relation when decoding because it normally can get the > needed > > information from WAL. > > It does lock catalog tables briefly. There's no guarantee that such locks are > released immediately. I forgot the details, but IIRC there's some outfuncs > (enum?) that intentionally delay releasing locks till transaction commit. Thanks for the explanation ! I understand that the startup process can take lock on the catalog(when replaying record) which may conflict with the lock in walsender. But in walsender, I think we only start transaction after entering ReorderBufferProcessTXN(), and the transaction started here will be released soon after processing and outputting the decoded transaction's data(as the comment in ReorderBufferProcessTXN() says:" all locks acquired in here to be released, not reassigned to the parent and we do not want any database access have persistent effects."). Besides, during the process and output of the decoded transaction, the walsender won't wait for the wakeup of startup process(e.g. WalSndWaitForWal()), it only waits if the data is being sent to subscriber. So it seems the lock conflict here won't cause the deadlock for now, although it may have a risk if we change this logic later. Sorry if I missed something, and thanks again for your patience in explaining. Best Regards, Hou zj
Re: walsender performance regression due to logical decoding on standby changes
От
Andres Freund
Дата:
Hi, On 2023-05-24 05:53:51 +0000, Zhijie Hou (Fujitsu) wrote: > On Tuesday, May 23, 2023 1:53 AM Andres Freund <andres@anarazel.de> wrote: > > On 2023-05-22 12:15:07 +0000, Zhijie Hou (Fujitsu) wrote: > > > About "a backend doing logical decoding", do you mean the case when a > > user > > > start a backend and invoke pg_logical_slot_get_changes() to do the logical > > > decoding ? If so, it seems the logical decoding in a backend won't be waked > > up > > > by startup process because the backend won't be registered as a walsender > > so > > > the backend won't be found in WalSndWakeup(). > > > > I meant logical decoding happening inside a walsender instance. > > > > > > > Or do you mean the deadlock between the real logical walsender and startup > > > process ? (I might miss something) I think the logical decoding doesn't lock > > > the target user relation when decoding because it normally can get the > > needed > > > information from WAL. > > > > It does lock catalog tables briefly. There's no guarantee that such locks are > > released immediately. I forgot the details, but IIRC there's some outfuncs > > (enum?) that intentionally delay releasing locks till transaction commit. > > Thanks for the explanation ! > > I understand that the startup process can take lock on the catalog(when > replaying record) which may conflict with the lock in walsender. > > But in walsender, I think we only start transaction after entering > ReorderBufferProcessTXN(), and the transaction started here will be released > soon after processing and outputting the decoded transaction's data(as the > comment in ReorderBufferProcessTXN() says:" all locks acquired in here to be > released, not reassigned to the parent and we do not want any database access > have persistent effects."). It's possible that there's no immediate danger - but I wouldn't want to bet on it staying that way. Think about things like streaming out large transactions etc. There also are potential dangers outside of plain things like locks - there could be a recovery conflicts on the snapshot or such (although that normally should be prevented via hot_standby_feedback or such). Even if there's no hard deadlock issue - not waking up a walsender-in-logical-decoding that's currently waiting because (replay_count % N) != 0 means that the walsender might be delayed for quite a while - there might not be any further records from the primary in the near future. I don't think any approach purely based on record counts has any chance to work well. You could combine it with something else, e.g. with always waking up in XLogPageRead() and WaitForWALToBecomeAvailable(). But then you end up with something relatively complicated again. Greetings, Andres Freund