walsender "wakeup storm" on PG16, likely because of bc971f4025c (Optimize walsender wake up logic using condition variables)
От | Tomas Vondra |
---|---|
Тема | walsender "wakeup storm" on PG16, likely because of bc971f4025c (Optimize walsender wake up logic using condition variables) |
Дата | |
Msg-id | 2840876b-4cfe-240f-0a7e-29ffd66711e7@enterprisedb.com обсуждение исходный текст |
Ответы |
Re: walsender "wakeup storm" on PG16, likely because of bc971f4025c (Optimize walsender wake up logic using condition variables)
(Andres Freund <andres@anarazel.de>)
|
Список | pgsql-hackers |
Hi, while working on some logical replication stuff, I noticed that on PG16 I often end up with a completely idle publisher (no user activity), that however looks like this in top: ... %CPU COMMAND ... 17.9 postgres: walsender user test ::1(43064) START_REPLICATION ... 16.6 postgres: walsender user test ::1(43128) START_REPLICATION ... 15.0 postgres: walsender user test ::1(43202) START_REPLICATION ... 6.6 postgres: walsender user test ::1(43236) START_REPLICATION ... 5.3 postgres: walsender user test ::1(43086) START_REPLICATION ... 4.3 postgres: walsender user test ::1(43180) START_REPLICATION ... 3.7 postgres: walsender user test ::1(43052) START_REPLICATION ... 3.7 postgres: walsender user test ::1(43158) START_REPLICATION ... 3.0 postgres: walsender user test ::1(43108) START_REPLICATION ... 3.0 postgres: walsender user test ::1(43214) START_REPLICATION That's an awful lot of CPU for a cluster doing essentially "nothing" (there's no WAL to decode/replicate, etc.). It usually disappears after a couple seconds, but sometimes it's a rather persistent state. The profile from the walsender processes looks like this: --99.94%--XLogSendLogical | |--99.23%--XLogReadRecord | XLogReadAhead | XLogDecodeNextRecord | ReadPageInternal | logical_read_xlog_page | | | |--97.80%--WalSndWaitForWal | | | | | |--68.48%--WalSndWait It seems to me the issue is in WalSndWait, which was reworked to use ConditionVariableCancelSleep() in bc971f4025c. The walsenders end up waking each other in a busy loop, until the timing changes just enough to break the cycle. I've been unable to reproduce this on PG15, and bc971f4025c seems like the most significant change to WalSndWait, which is why I suspect it's related to the issue. Reproducing this is simple, create a publisher with multiple subscribers (could even go to the same subscriber instance) and empty publications. If you trigger a "noop" it's likely to cause this high memory usage: --------------------------------------------------------------------- # init two clusters pg_ctl -D data-publisher init pg_ctl -D data-subscriber init # change the parameters to allow 10 subscriptions echo 'wal_level = logical' >> data-publisher/postgresql.conf echo 'port = 5433' >> data-subscriber/postgresql.conf echo 'max_worker_processes = 20' >> data-subscriber/postgresql.conf echo 'max_logical_replication_workers = 20' >> data-subscriber/postgresql.conf # setup empty publication createdb test psql test -c "create publication p"; # setup 10 subscriptions for i in `seq 1 10`; do createdb -p 5433 test$i psql -p 5433 test$i -c "create subscription s$i CONNECTION 'host=localhost port=5432 dbname=test' publication p"; done # emit logical messages, which are almost noop, 5s apart for i in `seq 1 10`; do psql test -c "select pg_logical_emit_message(false, 'x', 'x')"; sleep 5; done; --------------------------------------------------------------------- regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
В списке pgsql-hackers по дате отправления:
Следующее
От: Melih MutluДата:
Сообщение: Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication