Re: Implement waiting for wal lsn replay: reloaded

Поиск
Список
Период
Сортировка
От Xuneng Zhou
Тема Re: Implement waiting for wal lsn replay: reloaded
Дата
Msg-id CABPTF7WN_3kDPBYPxaKKcp2kO5BLB5bK_YGz70VTzTCivHibZA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Implement waiting for wal lsn replay: reloaded  (Andres Freund <andres@anarazel.de>)
Ответы Re: Implement waiting for wal lsn replay: reloaded
Список pgsql-hackers
Hi,

On Wed, Jan 7, 2026 at 8:32 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2026-01-06 18:42:59 +1300, Thomas Munro wrote:
> > Could this be causing the recent flapping failures on CI/macOS in
> > recovery/031_recovery_conflict?  I didn't have time to dig personally
> > but f30848cb looks relevant:
> >
> > Waiting for replication conn standby's replay_lsn to pass 0/03467F58 on primary
> > error running SQL: 'psql:<stdin>:1: ERROR:  canceling statement due to
> > conflict with recovery
> > DETAIL:  User was or might have been using tablespace that must be dropped.'
> > while running 'psql --no-psqlrc --no-align --tuples-only --quiet
> > --dbname port=25195
> > host=/var/folders/g9/7rkt8rt1241bwwhd3_s8ndp40000gn/T/LqcCJnsueI
> > dbname='postgres' --file - --variable ON_ERROR_STOP=1' with sql 'WAIT
> > FOR LSN '0/03467F58' WITH (MODE 'standby_replay', timeout '180s',
> > no_throw);' at /Users/admin/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm
> > line 2300.
> >
> > https://cirrus-ci.com/task/5771274900733952
> >
> > The master branch in time-descending order, macOS tasks only:
> >
> >      task_id      | substring |  status
> > ------------------+-----------+-----------
> >  6460882231754752 | c970bdc0  | FAILED
> >  5771274900733952 | 6ca8506e  | FAILED
> >  6217757068361728 | 63ed3bc7  | FAILED
> >  5980650261446656 | ae283736  | FAILED
> >  6585898394976256 | 5f13999a  | COMPLETED
> >  4527474786172928 | 7f9acc9b  | COMPLETED
> >  4826100842364928 | e8d4e94a  | COMPLETED
> >  4540563027918848 | b9ee5f2d  | FAILED
> >  6358528648019968 | c5af141c  | FAILED
> >  5998005284765696 | e212a0f8  | COMPLETED
> >  6488580526178304 | b85d5dc0  | FAILED
> >  5034091344560128 | 7dc95cc3  | ABORTED
> >  5688692477526016 | bb048e31  | COMPLETED
> >  5481187977723904 | d351063e  | COMPLETED
> >  5101831568752640 | f30848cb  | COMPLETED <-- the change
> >  6395317408497664 | 3f33b63d  | COMPLETED
> >  6741325208354816 | 877ae5db  | COMPLETED
> >  4594007789010944 | de746e0d  | COMPLETED
> >  6497208998035456 | 461b8cc9  | COMPLETED
>
> The failure rates of this are very high - the majority of the CI runs on the
> postgres/postgres repos failed since the change went in. Which then also means
> cfbot has a very high spurious failure rate. I think we need to revert this
> change until the problem has been verified as fixed.

This specific failure can be reproduced with this patch v1.

I guess the potential race condition is: when
wait_for_replay_catchup() runs WAIT FOR LSN on the standby, if a
tablespace conflict fires during that wait, the WAIT FOR LSN session
is killed even though it doesn't use the tablespace.

In my test, the failure won't occur after applying the v2 patch.

--
Best,
Xuneng

Вложения

В списке pgsql-hackers по дате отправления: