Обсуждение: Logical replication breaks: "unexpected duplicate for tablespace 0, relfilenode 2774069304"

Поиск
Список
Период
Сортировка

Logical replication breaks: "unexpected duplicate for tablespace 0, relfilenode 2774069304"

От
Kouber Saparev
Дата:
We are using logical replication in a quite busy environment. On the publisher side temporary tables are created and dropped all the time (due to some Zend Entity Framework extension "optimisation"), thus bloating heavily the system catalogs (among others).

At some point all our logical replication subscribers / replication slots drop, because of an error:

"could not receive data from WAL stream: ERROR:  unexpected duplicate for tablespace 0, relfilenode 2774069304"

The table for this file node is not even included in any of the publications we have. I've found a similar issue described [1] before, so I was wondering whether this patch is applied? Our subscriber database is PostgreSQL 16.1 and the publisher - PostgreSQL 15.4.

What quick solution would fix the replication? Repack of the table? Reload of the database? Killing some backends?

We rely heavily on this feature in a production environment and cannot just leave the subscriber side out of sync.

Regards,
--
Kouber Saparev

[1] 

Re: Logical replication breaks: "unexpected duplicate for tablespace 0, relfilenode 2774069304"

От
Michael Paquier
Дата:
On Fri, Dec 22, 2023 at 10:55:24AM +0200, Kouber Saparev wrote:
> The table for this file node is not even included in any of the
> publications we have. I've found a similar issue described [1] before, so I
> was wondering whether this patch is applied? Our subscriber database is
> PostgreSQL 16.1 and the publisher - PostgreSQL 15.4.

Or just this link using the community archives based on the
message-ID:
https://www.postgresql.org/message-id/TYCPR01MB83731ADE7FD7C7CF5D335BCEEDE99@TYCPR01MB8373.jpnprd01.prod.outlook.com

> What quick solution would fix the replication? Repack of the table? Reload
> of the database? Killing some backends?

There may be something you could do as a short-term solution, but it
does not solve the actual root of the problem, because the error you
are seeing is not something users should be able to face.

The first problem that we have here is that we've lost track of the
patch proposed, so I have added a CF entry for now:
https://commitfest.postgresql.org/46/4720/
--
Michael

Вложения

Re: Logical replication breaks: "unexpected duplicate for tablespace 0, relfilenode 2774069304"

От
Kouber Saparev
Дата:

На нд, 24.12.2023 г. в 3:37 Michael Paquier <michael@paquier.xyz> написа:
> What quick solution would fix the replication? Repack of the table? Reload
> of the database? Killing some backends?

There may be something you could do as a short-term solution, but it
does not solve the actual root of the problem, because the error you
are seeing is not something users should be able to face.

We need to have an action plan once this happens again (which might be in the middle of the night etc.) - i.e. how to rebuild and resync our logically replicated tables, because trying to just enable the subscription does not work - the same error reappears, so we have to drop all the slots, recreate them and deal with the syncing staff. If a repack (or something else) on the publisher side could allow us to re-enable the subscription easily, without dropping the slots, then for the moment it will save us from this prolonged desync/downtime situation.
 
The first problem that we have here is that we've lost track of the
patch proposed, so I have added a CF entry for now:
https://commitfest.postgresql.org/46/4720/

Thank you. Is there a bug report or should we file one? It looks like something that compromises the reliability of the logical replication as a whole. 

--
Kouber
On Thu, Dec 28, 2023 at 02:03:12PM +0200, Kouber Saparev wrote:
>> The first problem that we have here is that we've lost track of the
>> patch proposed, so I have added a CF entry for now:
>> https://commitfest.postgresql.org/46/4720/
>
> Thank you. Is there a bug report or should we file one? It looks like
> something that compromises the reliability of the logical replication as a
> whole.

Having a CF entry means that it is already tracked, so no need to do
more here for the moment.  The next step would be to look at the
proposed patch in more details, and work on fixing the issue.
--
Michael

Вложения