Re: 035_standby_logical_decoding unbounded hang
От | Bertrand Drouvot |
---|---|
Тема | Re: 035_standby_logical_decoding unbounded hang |
Дата | |
Msg-id | ZczcxIOes4s6B7QA@ip-10-97-1-34.eu-west-3.compute.internal обсуждение исходный текст |
Ответ на | 035_standby_logical_decoding unbounded hang (Noah Misch <noah@leadboat.com>) |
Ответы |
Re: 035_standby_logical_decoding unbounded hang
|
Список | pgsql-hackers |
Hi, On Sat, Feb 10, 2024 at 05:02:27PM -0800, Noah Misch wrote: > Coincidentally, one of my buildfarm animals hanged several weeks in a > different test, 035_standby_logical_decoding.pl. A LOG_SNAPSHOT_INTERVAL_MS > reduction was part of making it reproducible: > > On Fri, Feb 02, 2024 at 04:01:45PM -0800, Noah Misch wrote: > > On Fri, Feb 02, 2024 at 02:30:03PM -0800, Noah Misch wrote: > > > On Fri, Feb 02, 2024 at 05:07:14PM -0500, Tom Lane wrote: > > > > If you look at the buildfarm's failures page and filter down to > > > > just subscriptionCheck failures, what you find is that all of the > > > > last 6 such failures are in 031_column_list.pl: > > > > https://www.postgresql.org/message-id/flat/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com > > > reported a replacement BgWriterDelay value reproducing it. > > > > Correction: the recipe changes LOG_SNAPSHOT_INTERVAL_MS in addition to > > BgWriterDelay. > > I'm reusing this thread just in case there's overlap with the > 031_column_list.pl cause and fix. The 035_standby_logical_decoding.pl hang is > a race condition arising from an event sequence like this: > > - Test script sends CREATE SUBSCRIPTION to subscriber, which loses the CPU. > - Test script calls pg_log_standby_snapshot() on primary. Emits XLOG_RUNNING_XACTS. > - checkpoint_timeout makes a primary checkpoint finish. Emits XLOG_RUNNING_XACTS. > - bgwriter executes LOG_SNAPSHOT_INTERVAL_MS logic. Emits XLOG_RUNNING_XACTS. > - CREATE SUBSCRIPTION wakes up and sends CREATE_REPLICATION_SLOT to standby. > > Other test code already has a solution for this, so the attached patches add a > timeout and copy the existing solution. I'm also attaching the hack that > makes it 100% reproducible. Thanks! I did a few tests and confirm that the proposed solution fixes the corner case. standby-slot-test-1-timeout-v1.patch LGTM. Regarding standby-slot-test-2-race-v1.patch: > +# See corresponding create_logical_slot_on_standby() code. > +$node_standby->poll_query_until( > + 'postgres', qq[ > + SELECT restart_lsn IS NOT NULL > + FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' > + ]) or die "timed out waiting for logical slot to calculate its restart_lsn"; > + What about creating a sub, say wait_for_restart_lsn_calculation() in Cluster.pm and then make use of it in create_logical_slot_on_standby() and above? (something like wait_for_restart_lsn_calculation-v1.patch attached). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
В списке pgsql-hackers по дате отправления: