Обсуждение: [MASSMAIL]Failing streaming replication on PostgreSQL 14

Поиск
Список
Период
Сортировка

[MASSMAIL]Failing streaming replication on PostgreSQL 14

От
Nicolas Seinlet
Дата:
Hello everyone,

Since I moved some clusters from PostgreSQL 12 to 14, I noticed random failures in streaming replication. I say
"random"mostly because I haven't got the source of the issue. 

I'm using the Ubuntu/cyphered ZFS/PostgreSQL combination. I'm using Ubuntu LTS (20.04 22.04) and provided
ZFS/PostgreSQLwith LTS (PostgreSQL 12 on Ubuntu 20.04 and 14 on 22.04). 

The streaming replication of PostgreSQL is configured with `primary_conninfo 'host=main_server port=5432 user=replicant
password=a_very_secure_passwordsslmode=require application_name=replication_postgresql_app' ` , no replication slot nor
restorecommand, and the wal is configured with `full_page_writes = off wal_init_zero = off wal_recycle = off` 

If this works like a charm on PostgreSQL 12, it's sometimes failing with PostgreSQL 14. As we also changed the OS,
maybethe issue relies somewhere else. 

When the issue is detected, the WAL on the primary is correct. A piece of the WAL is wrong on the secondary. Only some
bytes.Some bytes later, the wal is again correct. Stopping PostgreSQL on the secondary, removing the wrong WAL file,
andrestarting PostgreSQL solves the issue. 

We've added another secondary and noticed the issue can appear on one of the secondaries, not both at the same time.

What can I do to detect the origin of this issue?

Have a nice week,

Nicolas.

Вложения

Re: Failing streaming replication on PostgreSQL 14

От
Ron Johnson
Дата:
On Mon, Apr 15, 2024 at 2:53 AM Nicolas Seinlet <nicolas@seinlet.com> wrote:
Hello everyone,

Since I moved some clusters from PostgreSQL 12 to 14, I noticed random failures in streaming replication. I say "random" mostly because I haven't got the source of the issue.

I'm using the Ubuntu/cyphered ZFS/PostgreSQL combination. I'm using Ubuntu LTS (20.04 22.04) and provided ZFS/PostgreSQL with LTS (PostgreSQL 12 on Ubuntu 20.04 and 14 on 22.04).

The streaming replication of PostgreSQL is configured with `primary_conninfo 'host=main_server port=5432 user=replicant password=a_very_secure_password sslmode=require application_name=replication_postgresql_app' ` , no replication slot nor restore command, and the wal is configured with `full_page_writes = off wal_init_zero = off wal_recycle = off`

If this works like a charm on PostgreSQL 12, it's sometimes failing with PostgreSQL 14. As we also changed the OS, maybe the issue relies somewhere else.

When the issue is detected, the WAL on the primary is correct. A piece of the WAL is wrong on the secondary. Only some bytes. Some bytes later, the wal is again correct. Stopping PostgreSQL on the secondary, removing the wrong WAL file, and restarting PostgreSQL solves the issue.

We've added another secondary and noticed the issue can appear on one of the secondaries, not both at the same time.

What can I do to detect the origin of this issue?

1. Minor version number?
2. Using replication_slots?
3. Error message(s)?


Re: Failing streaming replication on PostgreSQL 14

От
Nicolas Seinlet
Дата:
On Monday, April 15th, 2024 at 14:36, Ron Johnson <ronljohnsonjr@gmail.com> wrote:

> On Mon, Apr 15, 2024 at 2:53 AM Nicolas Seinlet <nicolas@seinlet.com> wrote:
>

> > Hello everyone,
> >

> > Since I moved some clusters from PostgreSQL 12 to 14, I noticed random failures in streaming replication. I say
"random"mostly because I haven't got the source of the issue. 
> >

> > I'm using the Ubuntu/cyphered ZFS/PostgreSQL combination. I'm using Ubuntu LTS (20.04 22.04) and provided
ZFS/PostgreSQLwith LTS (PostgreSQL 12 on Ubuntu 20.04 and 14 on 22.04). 
> >

> > The streaming replication of PostgreSQL is configured with `primary_conninfo 'host=main_server port=5432
user=replicantpassword=a_very_secure_password sslmode=require application_name=replication_postgresql_app' ` , no
replicationslot nor restore command, and the wal is configured with `full_page_writes = off wal_init_zero = off
wal_recycle= off` 
> >

> > If this works like a charm on PostgreSQL 12, it's sometimes failing with PostgreSQL 14. As we also changed the OS,
maybethe issue relies somewhere else. 
> >

> > When the issue is detected, the WAL on the primary is correct. A piece of the WAL is wrong on the secondary. Only
somebytes. Some bytes later, the wal is again correct. Stopping PostgreSQL on the secondary, removing the wrong WAL
file,and restarting PostgreSQL solves the issue. 
> >

> > We've added another secondary and noticed the issue can appear on one of the secondaries, not both at the same
time.
> >

> > What can I do to detect the origin of this issue?
>

>

> 1. Minor version number?
> 2. Using replication_slots?
> 3. Error message(s)?
>

>


Hi,


1.  PostgreSQL 14.11
2.  No. no replication slot nor restore command. As we've understood the replication slot, it's a mechanism to keep on
theprimary side everything needed for the secondary to recover. Will this make the primary acknowledge that the
secondaryreceived the good wal file? 
3.  incorrect resource manager data checksum

Looking at the WAL files with xxd gives the following diff:

The bad one:
006c9160: 0a6e 7514 5030 2e31 0e35 016c 0f07 0009 2f62 6568 6100 7669 6f72 3a6e 6f72 be6d
.nu.P0.1.5.l..../beha.vior:nor.m
006c9180: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
................................
006c91a0: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
................................
006c91c0: 437a 4263 7500 7273 6f72 3a70 6f69 0302 4503 9023 3237 3665 3720 323b 223e 5527  CzBcu.rsor:poi..E..#276e7
2;">U'

The good one contains the same 1st and 4th lines, but the 2nd and 3rd lines contain the correct values, as if a packet
wasmissed. 

Thanks for helping,

Nicolas.

Вложения

Re: Failing streaming replication on PostgreSQL 14

От
Alvaro Herrera
Дата:
On 2024-Apr-15, Nicolas Seinlet wrote:

> I'm using the Ubuntu/cyphered ZFS/PostgreSQL combination. I'm using
> Ubuntu LTS (20.04 22.04) and provided ZFS/PostgreSQL with LTS
> (PostgreSQL 12 on Ubuntu 20.04 and 14 on 22.04).

What exactly is "cyphered ZFS"?  Can you reproduce the problem with some
other filesystem?  If it's something very unusual, it might well be a
bug in the filesystem.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Find a bug in a program, and fix it, and the program will work today.
Show the program how to find and fix a bug, and the program
will work forever" (Oliver Silfridge)