Re: Startup PANIC on standby promotion due to zero-filled WAL segment

Поиск
Список
Период
Сортировка
От Michael Paquier
Тема Re: Startup PANIC on standby promotion due to zero-filled WAL segment
Дата
Msg-id aUplOdaM4kGYd4t3@paquier.xyz
обсуждение исходный текст
Ответ на Re: Startup PANIC on standby promotion due to zero-filled WAL segment  (Alena Vinter <dlaaren8@gmail.com>)
Ответы Re: Startup PANIC on standby promotion due to zero-filled WAL segment
Список pgsql-hackers
On Tue, Dec 23, 2025 at 04:33:30PM +0700, Alena Vinter wrote:
> Thanks for the review. To clarify: TLI 1 does not diverge — it is fully
> replicated to the standby before the timeline switch. The test then
> intentionally slows down replication on TLI 2 (e.g., by delaying WAL
> shipping), reproducing the scenario I illustrated. As far as I’m aware,
> `fsync` is `on` by default, and the test does not modify it — so no WAL
> records are lost due to unsafe flushing.

Don't think so, based on what is in the tree:
$ git grep "fsync = " -- *.pm
src/test/perl/PostgreSQL/Test/Cluster.pm:   print $conf "fsync = off\n";

> The core issue is that the new timeline’s segment is zero-initialized
> instead of copying the same segment from the previous timeline (as done in
> crash-recovery startup).  As a result, startup cannot finish recovery due
> to non-replicated end of WAL causing failures like “invalid magic number”.

The following addition to your proposed test is telling me an entirely
 different story, making the test pass as the records of TLI 1 are
 around:
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
 $node_primary->init(allows_streaming => 1);
+#$node_primary->append_conf('postgresql.conf', 'fsync=on');
 $node_primary->start;
--
Michael

Вложения

В списке pgsql-hackers по дате отправления: