Обсуждение: Startup PANIC on standby promotion due to zero-filled WAL segment

Поиск
Список
Период
Сортировка

Startup PANIC on standby promotion due to zero-filled WAL segment

От
Alena Vinter
Дата:
Hi hackers,

During replication, when a new timeline is detected, PostgreSQL creates a new zero-filled WAL segment on the new timeline instead of copying the partial segment from the previous timeline. This diverges from the behavior during timeline switches at startup.
This discrepancy can cause problems — especially under slow replication. Consider the following scenario:

 last record in TLI |     | timeline switch point
                    v     v
|-----TLI N---------------|0000000000000000000
                          |
|-----TLI N+1--00000000000|0000000000000000000

If a standby is promoted before the WAL segment containing the last record of the previous timeline has been fully copied to the new timeline, startup may fail. We have observed this in production, where recovery aborts with "PANIC: invalid magic number 0000 in WAL segment ..."

I’ve attached:
* a patch and a TAP test that reproduce the issue;
* a draft patch that, on timeline switch during recovery, copies the remainder of the current WAL segment from the old timeline — matching the behavior used after crash recovery at startup.
All existing regression tests pass with the patch applied, but I plan to add more targeted test cases.

I’d appreciate your feedback. In particular:
* Is this behavior (not copying the segment during replication) intentional?
* Are there edge cases I might be overlooking?

---
Best wishes,
Alena Vinter
Вложения

Re: Startup PANIC on standby promotion due to zero-filled WAL segment

От
Michael Paquier
Дата:
On Tue, Dec 23, 2025 at 02:02:15PM +0700, Alena Vinter wrote:
> If a standby is promoted before the WAL segment containing the last record
> of the previous timeline has been fully copied to the new timeline, startup
> may fail. We have observed this in production, where recovery aborts with
> "PANIC: invalid magic number 0000 in WAL segment ..."
>
> I’ve attached:
> * a patch and a TAP test that reproduce the issue;
> * a draft patch that, on timeline switch during recovery, copies the
> remainder of the current WAL segment from the old timeline — matching the
> behavior used after crash recovery at startup.
> All existing regression tests pass with the patch applied, but I plan to
> add more targeted test cases.
>
> I’d appreciate your feedback. In particular:
> * Is this behavior (not copying the segment during replication) intentional?
> * Are there edge cases I might be overlooking?

The failure pattern is different in v18/master vs the rest of the
world.  v17 and older branches just wait for the standby node to start
at the end with your test.  Anyway, the problem is the same as far as
I can see, with the test generating the following post-patch:
2025-12-23 17:08:37.494 JST startup[32689] LOG:  unexpected pageaddr
0/0305E000 in WAL segment 000000020000000000000003, LSN 0/03060000,
offset 393216
2025-12-23 17:08:37.494 JST startup[32689] FATAL:  according to
history file, WAL location 0/0305FFD0 belongs to timeline 1, but
previous recovered WAL file came from timeline 2

This would be right, because you are losing the records of the first
INSERT and TLI 1 diverges on the primary.  Now, the reason why you are
losing these records is because of the way the test is set up.  fsync
is off on the primary, hence you are forcing what looks like a
corruption scenario by forcing a node to be promoted with some of its
WAL records missing.  I am unconvinced with the problem the way you
are showing it.  This primarily shows that setting fsync=off is a bad
idea to force a divergence in timelines, with the segment missing
while the records should be there.

Perhaps it is a matter of proving your point in a cleaner way?  I am
open to your potential arguments, but I don't see something here based
on the test you are sending; I am just seeing something that should
not be done.

I am not asking how you are able to see these failures in your
Postgres setups, but perhaps there is something in your HA flow that
you should not do, especially if you do the same things as in this
test..  Just saying.
--
Michael

Вложения

Re: Startup PANIC on standby promotion due to zero-filled WAL segment

От
Alena Vinter
Дата:
Hi Michael,

Thanks for the review. To clarify: TLI 1 does not diverge — it is fully replicated to the standby before the timeline switch. The test then intentionally slows down replication on TLI 2 (e.g., by delaying WAL shipping), reproducing the scenario I illustrated. As far as I’m aware, `fsync` is `on` by default, and the test does not modify it — so no WAL records are lost due to unsafe flushing.

The core issue is that the new timeline’s segment is zero-initialized instead of copying the same segment from the previous timeline (as done in crash-recovery startup).  As a result, startup cannot finish recovery due to non-replicated end of WAL causing failures like “invalid magic number”. 

---
Alena Vinter 

Re: Startup PANIC on standby promotion due to zero-filled WAL segment

От
Michael Paquier
Дата:
On Tue, Dec 23, 2025 at 04:33:30PM +0700, Alena Vinter wrote:
> Thanks for the review. To clarify: TLI 1 does not diverge — it is fully
> replicated to the standby before the timeline switch. The test then
> intentionally slows down replication on TLI 2 (e.g., by delaying WAL
> shipping), reproducing the scenario I illustrated. As far as I’m aware,
> `fsync` is `on` by default, and the test does not modify it — so no WAL
> records are lost due to unsafe flushing.

Don't think so, based on what is in the tree:
$ git grep "fsync = " -- *.pm
src/test/perl/PostgreSQL/Test/Cluster.pm:   print $conf "fsync = off\n";

> The core issue is that the new timeline’s segment is zero-initialized
> instead of copying the same segment from the previous timeline (as done in
> crash-recovery startup).  As a result, startup cannot finish recovery due
> to non-replicated end of WAL causing failures like “invalid magic number”.

The following addition to your proposed test is telling me an entirely
 different story, making the test pass as the records of TLI 1 are
 around:
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
 $node_primary->init(allows_streaming => 1);
+#$node_primary->append_conf('postgresql.conf', 'fsync=on');
 $node_primary->start;
--
Michael

Вложения

Re: Startup PANIC on standby promotion due to zero-filled WAL segment

От
Alena Vinter
Дата:
> Don't think so, based on what is in the tree:
> $ git grep "fsync = " -- *.pm
> src/test/perl/PostgreSQL/Test/Cluster.pm:   print $conf "fsync = off\n";

Oh, didn’t know it's in the `init` function — I apologize.

> The following addition to your proposed test is telling me an entirely
> different story, making the test pass as the records of TLI 1 are
> around:
> my $node_primary = PostgreSQL::Test::Cluster->new('primary');
> $node_primary->init(allows_streaming => 1);
> +#$node_primary->append_conf('postgresql.conf', 'fsync=on');
> $node_primary->start;

I've tried this way, and yes, this works fine. Now I'm really interested in how this parameter prevents the scenario with startup panic. Thank you very much!!!
But I’m still unclear why the segment isn’t copied during replication, as it is in crash recovery (I’d prefer uniform behavior across both paths). Could you help me figure out the answer to that question?
---
Alena Vinter

Re: Startup PANIC on standby promotion due to zero-filled WAL segment

От
Alena Vinter
Дата:
Michael, I left my pipeline running the TAP test until it failed — and after some time, it did fail. I then changed the test slightly, and simply by adding a short sleep, I was able to reproduce the same failure more reliably. Moreover, attempting to restart the standby server after a failed promotion triggers startup PANIC again.
Вложения

Re: Startup PANIC on standby promotion due to zero-filled WAL segment

От
Michael Paquier
Дата:
On Tue, Dec 23, 2025 at 08:49:20PM +0700, Alena Vinter wrote:
> Michael, I left my pipeline running the TAP test until it failed — and
> after some time, it did fail. I then changed the test slightly, and simply
> by adding a short sleep, I was able to reproduce the same failure more
> reliably. Moreover, attempting to restart the standby server after a failed
> promotion triggers startup PANIC again.

This is a better argument, yes.  ProcessInterrupts() is just a way to
force the WAL receiver to do nothing.  We could see the same if a WAL
receiver fails a palloc() or an allocation repeatedly, shutting it
down before it is able to stream any changes, and we could also have a
test with an injection point that forces an error based on a specific
specific timeline number, or something like that.

Hmm.  Like in the case where the WAL receiver is not able to connect
to a primary, shouldn't we prevent the promotion request to process at
all?  So while you have your finger on something here, I don't think
that your suggested solution is a good nor correct one: it sounds to
me that the startup process assumes that the WAL receiver is doing
some work, then the promotion request comes it and we assume that it
is OK to process through the promotion while we should obviously not
do so, because the WAL receiver has streamed zero contents from TLI 2.
It sounds to me that we should let the startup process know that
something is wrong with the WAL receiver, meaning that it may be up to
the WAL receiver to save some information in shared memory so as the
startup process should not allow the promotion to go through at all.
--
Michael

Вложения

Re: Startup PANIC on standby promotion due to zero-filled WAL segment

От
Alena Vinter
Дата:
I like the idea of preventing promotion to avoid such failures -- it sounds reasonable.

However, we still have the problem: if the standby is stopped with non-replicated TLI 2, it will fail to start:
"FATAL: according to history file, WAL location Y belongs to timeline X, but previous recovered WAL file came from timeline X+1". 
This happens even if no promotion is attempted — just a plain restart of the standby. So the issue isn’t only about when to allow promotion.

Regarding my proposed solution: could you clarify why it isn’t correct? I’d appreciate more detail so I can address your concerns.

---
Alena Vinter
Вложения