Re: [BUG] recovery of prepared transactions during promotion can fail

Поиск

Список

Период

Сортировка

От	Michael Paquier
Тема	Re: [BUG] recovery of prepared transactions during promotion can fail
Дата	21 июня 2023 г. 08:14:52
Msg-id	ZJKHTPt2wIgDsUoi@paquier.xyz обсуждение исходный текст
Ответ на	Re: [BUG] recovery of prepared transactions during promotion can fail (Nathan Bossart <nathandbossart@gmail.com>)
Ответы	Re: [BUG] recovery of prepared transactions during promotion can fail
Список	pgsql-hackers

Дерево обсуждения

On Tue, Jun 20, 2023 at 09:33:45PM -0700, Nathan Bossart wrote:
> I've started seen sporadic timeouts for 009_twophase.pl in cfbot, and I'm
> wondering if it's related to this change.
>
>     https://api.cirrus-ci.com/v1/task/4978271838797824/logs/test_world.log
>     https://api.cirrus-ci.com/v1/task/5477247717474304/logs/test_world.log
>     https://api.cirrus-ci.com/v1/task/5931749746671616/logs/test_world.log
>     https://api.cirrus-ci.com/v1/task/6353051175354368/logs/test_world.log
>     https://api.cirrus-ci.com/v1/task/5687888986243072/logs/test_world.log

Thanks for the poke, missed that.

The logs are enough to know what's happening here.  All the tests
finish after this step:
[02:29:33.169] # Now paris is primary and london is standby
[02:29:33.169] ok 13 - Restore prepared transactions from records with
primary down

Here are some log files:

https://api.cirrus-ci.com/v1/artifact/task/5477247717474304/testrun/build/testrun/recovery/009_twophase/log/009_twophase_london.log

https://api.cirrus-ci.com/v1/artifact/task/5477247717474304/testrun/build/testrun/recovery/009_twophase/log/009_twophase_paris.log

Just after that, we start a previous primary as standby:
# restart old primary as new standby
$cur_standby->enable_streaming($cur_primary);
$cur_standby->start;

And the startup of the node gets stuck as the last partial segment is
now getting renamed, but the other node expects it to be available via
streaming.  From london, which is the new standby starting up:
2023-06-21 02:13:03.421 UTC [24652][walreceiver] LOG:  primary server
contains no more WAL on requested timeline 3
2023-06-21 02:13:03.421 UTC [24652][walreceiver] FATAL:  terminating
walreceiver process due to administrator command
2023-06-21 02:13:03.421 UTC [24647][startup] LOG:  new timeline 4
forked off current database system timeline 3 before current recovery
point 0/60000A0

And paris complains about that:
2023-06-21 02:13:03.515 UTC [24661][walsender] [london][4/0:0] LOG:
received replication command: START_REPLICATION 0/6000000 TIMELINE 3
2023-06-21 02:13:03.515 UTC [24661][walsender] [london][4/0:0]
STATEMENT:  START_REPLICATION 0/6000000 TIMELINE 3

But that won't connect work as the segment requested is now a partial
one in the primary's pg_wal, still the standby wants it.  Just
restoring the segments won't help much as we don't have anything for
partial segments in the TAP routines yet, so I think that it is better
for now to just undo has_archiving in has_archiving, and tackle the
coverage with a separate test, perhaps only for HEAD.
--
Michael

Вложения

signature.asc

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Nathan Bossart
Дата: 21 июня 2023 г., 07:33:45
Сообщение: Re: [BUG] recovery of prepared transactions during promotion can fail

Следующее

От: Andres Freund
Дата: 21 июня 2023 г., 08:27:13
Сообщение: Re: Assert while autovacuum was executing

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: [BUG] recovery of prepared transactions during promotion can fail

Вложения

Предыдущее

Следующее