Обсуждение: BUG #19396: Standby and DR site replication broken with PANIC: WAL contains references to invalid pages messge
BUG #19396: Standby and DR site replication broken with PANIC: WAL contains references to invalid pages messge
От
PG Bug reporting form
Дата:
The following bug has been logged on the website: Bug reference: 19396 Logged by: Ishan Joshi Email address: ishanjoshi@live.com PostgreSQL version: 16.9 Operating system: ubuntu on Kubernetes Description: Hi Team, I found an issue with PG 16.9 patroni setup where our sytandby node replication and disaster replication site replication broken with below error. It looks like WAL corruption which later part of archive file as well. CONTEXT: WAL redo at 184F3/F248B6F0 for Heap/LOCK: xmax: 2818115117, off: 35, infobits: [LOCK_ONLY, EXCL_LOCK], flags: 0x00; blkref #0: rel 1663/33195/410203483, blk 25329" PANIC: WAL contains references to invalid pages" CONTEXT: WAL redo at 184F3/F248B6F0 for Heap/LOCK: xmax: 2818115117, off: 35, infobits: [LOCK_ONLY, EXCL_LOCK], flags: 0x00; blkref #0: rel 1663/33195/410203483, blk 25329" WARNING: page 25329 of relation base/33195/410203483 does not exist" INFO: no action. I am (pg-patroni-node1-0), a secondary, and following a leader (pg-patroni-node2-0)" [61]LOG: terminating any other active server processes" [61]LOG: startup process (PID 72) was terminated by signal 6: Aborted" [61]LOG: shutting down due to startup process failure" [61]LOG: database system is shut down" INFO: establishing a new patroni heartbeat connection to postgres" INFO: Lock owner: pg-patroni-node2-0; I am pg-patroni-node1-0" WARNING: Retry got exception: connection problems" WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role" INFO: Error communicating with PostgreSQL. Will try again later" WARNING: Postgresql is not running." Primary db was not impacted, however standby node and DR site replication broken, I tried to reinit with latest backup + archive loading from pgbackrest backup but it fails with same error once the corrupt wal/archive file applying the changes. I had to reinit with pgbasebackup with 40TB database which took about 45 hrs of time. Looks like some RACE condition happend to WAL file that generate the issue. looks like potential bug of it.
On Mon, Feb 09, 2026 at 07:31:13AM +0000, PG Bug reporting form wrote: > Primary db was not impacted, however standby node and DR site replication > broken, I tried to reinit with latest backup + archive loading from > pgbackrest backup but it fails with same error once the corrupt wal/archive > file applying the changes. I had to reinit with pgbasebackup with 40TB > database which took about 45 hrs of time. > > Looks like some RACE condition happend to WAL file that generate the issue. > looks like potential bug of it. Perhaps so. However, it is basically impossible to determine if this is actually an issue without more information. Hence, one would need more input about the workloads involved (concurrency included), the pages touched, and the WAL patterns at least. The best thing possible would be a reproducible self-contained test case, of course, which could be used to evaluate the versions impacted and the potential solutions. Race conditions like that with predefined WAL patterns should be easy to reproduce with some injection points to force a strict ordering of WAL record, particularly if this is a problem that can be reproduced after a startup, where we just need to make sure that a node is able to recover. One thing that may matter, on top of my mind: does your backup setup rely on the in-core incremental backups with some combined backups? That could be a contributing factor, or not. -- Michael
Вложения
Hi Michael,
As have pgbackrest backup configure that used to take incremental backup. somehow, we tried to recover standby node with latest incremental backup, it was restored well but getting failed at the time of restoring archive file at same point where the actual issue was occurred.
we are running this in production for 6 months now and this is the first time we have face this issue. we never face such issue in performance testing environment where such kind of session/transcation executed. Here looks like wal file creation itself having issues as it trying to create table but it got failed the transaction with broken pipe error, this makes the transaction got rollback and not created table in production but WAL file missed to log the create table and rollback things that makes the some dml operation on that table got executed while reply on standby and it got failed in standby as well as in DR site.
as the WAL itself having wrong transaction sequence, archive file also having the same issue and that is makes recovery from backup also got failed.
ERROR while recovering from backup
As have pgbackrest backup configure that used to take incremental backup. somehow, we tried to recover standby node with latest incremental backup, it was restored well but getting failed at the time of restoring archive file at same point where the actual issue was occurred.
we are running this in production for 6 months now and this is the first time we have face this issue. we never face such issue in performance testing environment where such kind of session/transcation executed. Here looks like wal file creation itself having issues as it trying to create table but it got failed the transaction with broken pipe error, this makes the transaction got rollback and not created table in production but WAL file missed to log the create table and rollback things that makes the some dml operation on that table got executed while reply on standby and it got failed in standby as well as in DR site.
as the WAL itself having wrong transaction sequence, archive file also having the same issue and that is makes recovery from backup also got failed.
ERROR while recovering from backup
2026-02-06 12:28:44.069 P00 INFO: archive-get command begin 2.55.1: [0000002E000184F300000007, pg_wal/RECOVERYXLOG] --archive-async --archive-timeout=600 --compress-level-network=3 --exec-id=33803-b2d6a506 --log-level-console=info --log-level-file=detail --pg2-host=pg-patroni --pg1-path=/var/lib/pgsql/data/postgresql_node1/ --pg2-path=/var/lib/pgsql/data/postgresql_node2/ --process-max=15 --repo1-path=/var/lib/pgbackrest --repo1-s3-bucket=prdbucket --repo1-s3-endpoint=https://10.150.13.38:5443 --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=<region_name> --repo1-s3-uri-style=path --no-repo1-storage-verify-tls --repo1-type=s3 --spool-path=/var/lib/pgsql/data/pgbackrest --stanza=patroni 2026-02-06 12:28:44.069 P00 INFO: found 0000002E000184F300000007 in the archive asynchronously 2026-02-06 12:28:44.073 P00 INFO: archive-get command end: completed successfully (6ms) 2026-02-06 19:28:44.076 +07 [18415]LOG: restored log file "0000002E000184F300000007" from archive 2026-02-06 12:28:44,438 INFO: no action. I am (pg-patroni-node1-0), a secondary, and following a leader (pg-patroni-node2-0) 2026-02-06 12:28:45,425 INFO: no action. I am (pg-patroni-node1-0), a secondary, and following a leader (pg-patroni-node2-0) 2026-02-06 12:28:46,428 INFO: no action. I am (pg-patroni-node1-0), a secondary, and following a leader (pg-patroni-node2-0) 2026-02-06 19:28:46.567 +07 [18415]WARNING: page 25329 of relation base/33195/410203483 does not exist 2026-02-06 19:28:46.567 +07 [18415]CONTEXT: WAL redo at 184F3/F248B6F0 for Heap/LOCK: xmax: 2818115117, off: 35, infobits: [LOCK_ONLY, EXCL_LOCK], flags: 0x00; blkref #0: rel 1663/33195/410203483, blk 25329 2026-02-06 19:28:46.567 +07 [18415]PANIC: WAL contains references to invalid pages 2026-02-06 19:28:46.567 +07 [18415]CONTEXT: WAL redo at 184F3/F248B6F0 for Heap/LOCK: xmax: 2818115117, off: 35, infobits: [LOCK_ONLY, EXCL_LOCK], flags: 0x00; blkref #0: rel 1663/33195/410203483, blk 25329 2026-02-06 19:28:46.669 +07 [18407]LOG: startup process (PID 18415) was terminated by signal 6: Aborted 2026-02-06 19:28:46.670 +07 [18407]LOG: terminating any other active server processes 2026-02-06 19:28:46.797 +07 [18407]LOG: shutting down due to startup process failure 2026-02-06 19:28:46.978 +07 [18407]LOG: database system is shut down |
I am not sure if we can reproduce such scenario easly as it is very strange and rare scenario to validate. If you have any test scneario then pls suggest I can test it in local env.
Please suggest if any more information requires.
Thanks & Regards,
Ishan Joshi
From: Michael Paquier
Sent: Tuesday, February 10, 2026 06:04
To: ishanjoshi@live.com; pgsql-bugs@lists.postgresql.org
Subject: Re: BUG #19396: Standby and DR site replication broken with PANIC: WAL contains references to invalid pages messge
Sent: Tuesday, February 10, 2026 06:04
To: ishanjoshi@live.com; pgsql-bugs@lists.postgresql.org
Subject: Re: BUG #19396: Standby and DR site replication broken with PANIC: WAL contains references to invalid pages messge
On Mon, Feb 09, 2026 at 07:31:13AM +0000, PG Bug reporting form wrote:
> Primary db was not impacted, however standby node and DR site replication
> broken, I tried to reinit with latest backup + archive loading from
> pgbackrest backup but it fails with same error once the corrupt wal/archive
> file applying the changes. I had to reinit with pgbasebackup with 40TB
> database which took about 45 hrs of time.
>
> Looks like some RACE condition happend to WAL file that generate the issue.
> looks like potential bug of it.
Perhaps so. However, it is basically impossible to determine if this
is actually an issue without more information. Hence, one would need
more input about the workloads involved (concurrency included), the
pages touched, and the WAL patterns at least. The best thing possible
would be a reproducible self-contained test case, of course, which
could be used to evaluate the versions impacted and the potential
solutions. Race conditions like that with predefined WAL patterns
should be easy to reproduce with some injection points to force a
strict ordering of WAL record, particularly if this is a problem that
can be reproduced after a startup, where we just need to make sure
that a node is able to recover.
One thing that may matter, on top of my mind: does your backup setup
rely on the in-core incremental backups with some combined backups?
That could be a contributing factor, or not.
--
Michael
> Primary db was not impacted, however standby node and DR site replication
> broken, I tried to reinit with latest backup + archive loading from
> pgbackrest backup but it fails with same error once the corrupt wal/archive
> file applying the changes. I had to reinit with pgbasebackup with 40TB
> database which took about 45 hrs of time.
>
> Looks like some RACE condition happend to WAL file that generate the issue.
> looks like potential bug of it.
Perhaps so. However, it is basically impossible to determine if this
is actually an issue without more information. Hence, one would need
more input about the workloads involved (concurrency included), the
pages touched, and the WAL patterns at least. The best thing possible
would be a reproducible self-contained test case, of course, which
could be used to evaluate the versions impacted and the potential
solutions. Race conditions like that with predefined WAL patterns
should be easy to reproduce with some injection points to force a
strict ordering of WAL record, particularly if this is a problem that
can be reproduced after a startup, where we just need to make sure
that a node is able to recover.
One thing that may matter, on top of my mind: does your backup setup
rely on the in-core incremental backups with some combined backups?
That could be a contributing factor, or not.
--
Michael
Hi Team,
Any update on this?
Can we have any steps to reproduce such kind of scenario as WAL is very critical for recovery and if WAL file itself is getting corrupt, it will break both replication and recovery.
Thanks & Regards,
Ishan Joshi
From: Ishan joshi <ishanjoshi@live.com>
Sent: 10 February 2026 09:52
To: Michael Paquier <michael@paquier.xyz>; pgsql-bugs@lists.postgresql.org <pgsql-bugs@lists.postgresql.org>
Subject: Re: BUG #19396: Standby and DR site replication broken with PANIC: WAL contains references to invalid pages messge
Sent: 10 February 2026 09:52
To: Michael Paquier <michael@paquier.xyz>; pgsql-bugs@lists.postgresql.org <pgsql-bugs@lists.postgresql.org>
Subject: Re: BUG #19396: Standby and DR site replication broken with PANIC: WAL contains references to invalid pages messge
Hi Michael,
As have pgbackrest backup configure that used to take incremental backup. somehow, we tried to recover standby node with latest incremental backup, it was restored well but getting failed at the time of restoring archive file at same point where the actual issue was occurred.
we are running this in production for 6 months now and this is the first time we have face this issue. we never face such issue in performance testing environment where such kind of session/transcation executed. Here looks like wal file creation itself having issues as it trying to create table but it got failed the transaction with broken pipe error, this makes the transaction got rollback and not created table in production but WAL file missed to log the create table and rollback things that makes the some dml operation on that table got executed while reply on standby and it got failed in standby as well as in DR site.
as the WAL itself having wrong transaction sequence, archive file also having the same issue and that is makes recovery from backup also got failed.
ERROR while recovering from backup
As have pgbackrest backup configure that used to take incremental backup. somehow, we tried to recover standby node with latest incremental backup, it was restored well but getting failed at the time of restoring archive file at same point where the actual issue was occurred.
we are running this in production for 6 months now and this is the first time we have face this issue. we never face such issue in performance testing environment where such kind of session/transcation executed. Here looks like wal file creation itself having issues as it trying to create table but it got failed the transaction with broken pipe error, this makes the transaction got rollback and not created table in production but WAL file missed to log the create table and rollback things that makes the some dml operation on that table got executed while reply on standby and it got failed in standby as well as in DR site.
as the WAL itself having wrong transaction sequence, archive file also having the same issue and that is makes recovery from backup also got failed.
ERROR while recovering from backup
2026-02-06 12:28:44.069 P00 INFO: archive-get command begin 2.55.1: [0000002E000184F300000007, pg_wal/RECOVERYXLOG] --archive-async --archive-timeout=600 --compress-level-network=3 --exec-id=33803-b2d6a506 --log-level-console=info --log-level-file=detail --pg2-host=pg-patroni --pg1-path=/var/lib/pgsql/data/postgresql_node1/ --pg2-path=/var/lib/pgsql/data/postgresql_node2/ --process-max=15 --repo1-path=/var/lib/pgbackrest --repo1-s3-bucket=prdbucket --repo1-s3-endpoint=https://10.150.13.38:5443 --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=<region_name> --repo1-s3-uri-style=path --no-repo1-storage-verify-tls --repo1-type=s3 --spool-path=/var/lib/pgsql/data/pgbackrest --stanza=patroni 2026-02-06 12:28:44.069 P00 INFO: found 0000002E000184F300000007 in the archive asynchronously 2026-02-06 12:28:44.073 P00 INFO: archive-get command end: completed successfully (6ms) 2026-02-06 19:28:44.076 +07 [18415]LOG: restored log file "0000002E000184F300000007" from archive 2026-02-06 12:28:44,438 INFO: no action. I am (pg-patroni-node1-0), a secondary, and following a leader (pg-patroni-node2-0) 2026-02-06 12:28:45,425 INFO: no action. I am (pg-patroni-node1-0), a secondary, and following a leader (pg-patroni-node2-0) 2026-02-06 12:28:46,428 INFO: no action. I am (pg-patroni-node1-0), a secondary, and following a leader (pg-patroni-node2-0) 2026-02-06 19:28:46.567 +07 [18415]WARNING: page 25329 of relation base/33195/410203483 does not exist 2026-02-06 19:28:46.567 +07 [18415]CONTEXT: WAL redo at 184F3/F248B6F0 for Heap/LOCK: xmax: 2818115117, off: 35, infobits: [LOCK_ONLY, EXCL_LOCK], flags: 0x00; blkref #0: rel 1663/33195/410203483, blk 25329 2026-02-06 19:28:46.567 +07 [18415]PANIC: WAL contains references to invalid pages 2026-02-06 19:28:46.567 +07 [18415]CONTEXT: WAL redo at 184F3/F248B6F0 for Heap/LOCK: xmax: 2818115117, off: 35, infobits: [LOCK_ONLY, EXCL_LOCK], flags: 0x00; blkref #0: rel 1663/33195/410203483, blk 25329 2026-02-06 19:28:46.669 +07 [18407]LOG: startup process (PID 18415) was terminated by signal 6: Aborted 2026-02-06 19:28:46.670 +07 [18407]LOG: terminating any other active server processes 2026-02-06 19:28:46.797 +07 [18407]LOG: shutting down due to startup process failure 2026-02-06 19:28:46.978 +07 [18407]LOG: database system is shut down |
I am not sure if we can reproduce such scenario easly as it is very strange and rare scenario to validate. If you have any test scneario then pls suggest I can test it in local env.
Please suggest if any more information requires.
Thanks & Regards,
Ishan Joshi
From: Michael Paquier
Sent: Tuesday, February 10, 2026 06:04
To: ishanjoshi@live.com; pgsql-bugs@lists.postgresql.org
Subject: Re: BUG #19396: Standby and DR site replication broken with PANIC: WAL contains references to invalid pages messge
Sent: Tuesday, February 10, 2026 06:04
To: ishanjoshi@live.com; pgsql-bugs@lists.postgresql.org
Subject: Re: BUG #19396: Standby and DR site replication broken with PANIC: WAL contains references to invalid pages messge
On Mon, Feb 09, 2026 at 07:31:13AM +0000, PG Bug reporting form wrote:
> Primary db was not impacted, however standby node and DR site replication
> broken, I tried to reinit with latest backup + archive loading from
> pgbackrest backup but it fails with same error once the corrupt wal/archive
> file applying the changes. I had to reinit with pgbasebackup with 40TB
> database which took about 45 hrs of time.
>
> Looks like some RACE condition happend to WAL file that generate the issue.
> looks like potential bug of it.
Perhaps so. However, it is basically impossible to determine if this
is actually an issue without more information. Hence, one would need
more input about the workloads involved (concurrency included), the
pages touched, and the WAL patterns at least. The best thing possible
would be a reproducible self-contained test case, of course, which
could be used to evaluate the versions impacted and the potential
solutions. Race conditions like that with predefined WAL patterns
should be easy to reproduce with some injection points to force a
strict ordering of WAL record, particularly if this is a problem that
can be reproduced after a startup, where we just need to make sure
that a node is able to recover.
One thing that may matter, on top of my mind: does your backup setup
rely on the in-core incremental backups with some combined backups?
That could be a contributing factor, or not.
--
Michael
> Primary db was not impacted, however standby node and DR site replication
> broken, I tried to reinit with latest backup + archive loading from
> pgbackrest backup but it fails with same error once the corrupt wal/archive
> file applying the changes. I had to reinit with pgbasebackup with 40TB
> database which took about 45 hrs of time.
>
> Looks like some RACE condition happend to WAL file that generate the issue.
> looks like potential bug of it.
Perhaps so. However, it is basically impossible to determine if this
is actually an issue without more information. Hence, one would need
more input about the workloads involved (concurrency included), the
pages touched, and the WAL patterns at least. The best thing possible
would be a reproducible self-contained test case, of course, which
could be used to evaluate the versions impacted and the potential
solutions. Race conditions like that with predefined WAL patterns
should be easy to reproduce with some injection points to force a
strict ordering of WAL record, particularly if this is a problem that
can be reproduced after a startup, where we just need to make sure
that a node is able to recover.
One thing that may matter, on top of my mind: does your backup setup
rely on the in-core incremental backups with some combined backups?
That could be a contributing factor, or not.
--
Michael