Обсуждение: Orphaned records in pg_replication_origin_status after subscription drop

Поиск
Список
Период
Сортировка

Orphaned records in pg_replication_origin_status after subscription drop

От
Michael Paquier
Дата:
Hi all,
(Added Amit K. in CC.)

Some colleagues have reported that it is possible to finish with
orphaned records in pg_replication_origin_status, as an effect of
table synchronization workers that miss some cleanup actions around
replorigin_drop_by_name() when a subscription that spawned these
workers (which themselves create origins through replorigin_create()
called in LogicalRepSyncTableStart()) is dropped.

Please find attached a small script, courtesy of Tenglong Gu and
Daisuke Higuchi, that have been digging into all that.

This problem is true since ce0fdbfe9722 that has added the creation of
a replication origin for table synchronization workers, as of 14 up to
HEAD.  One issue with these origin states is that they are sticky: a
restart of the node that held the subscription keeps them around due
to them getting flushed in replorigin_checkpoint.  If my understanding
is right, this also means that origin slots are still tracked as in
use, implying that replorigin_session_setup() would not be able to
reuse them as far as I understand.

So, in summary (please correct me if necessary), DropSubscription()
attempts a round of cleanup for all the replication origins with a
loop over ReplicationOriginNameForLogicalRep(), and while in this case
we attempt to drop the origins created by the workers that are tracked
as orphaned, the syscache lookup of REPLORIGIDENT fails within the
DROP SUBSCRIPTION because replorigin_by_name() cannot find an entry:
the replication origin is already gone, but its state has persisted in
memory.  Then, why would the replication origin be already gone with
its state in memory not cleaned up yet?  Well, the problematic test
case shows that the subscription is dropped while the spawned
tablesync workers do the initial table copy, and the replication
origin is created in the same transaction as the COPY.  DROP
SUBSCRIPTION tells the tablesync workers to stop, they stop with the
COPY failing and the origin is not seen as something that exists for
the session that drops the subscription.  The replication state in
memory goes out of sync.

So it looks like to me that we are missing a code path where the
replication origin is dropped but we just ignore to reset the
replication state in memory, leading to bloat of the origin slots
available.  worker.c does things differently: a small transaction is
used for the origin creation, hence we would never really see the
replication state memory going out-of-sync with the replorigin
catalog.

One solution would be for tablesync.c to do the same as worker.c: we
could use a short transaction that sets the memory state and creates
the replication origin, then move to a second transaction for the
COPY.  A second solution, slightly more complicated, is to create some
specific logic to reset the progress state of the origin that's been
created in the transaction that runs the initial COPY, which I guess
should be in the shape of a transaction abort callback.  All these
symptoms are pointing out that it is not a good idea, IMO, to expect a
replication origin to exist when dropping a subscription when an
origin has been created in the context of a transaction that can take
a very long time to run, aka the initial tablesync's COPY, so using a
short transaction feels like a good thing to do here.

I have to admit that this code has become much more complicated over
the last couple of years, hence I may not have the full context of
how these cleanups actions should be achieved.  They cleary should
happen though, but I am not completely sure which solution would be
better over the other.  There may be an entirely different solution,
as well.

Note that it would not be complicated to create a regression test
based on an injection point: waiting during the COPY of a tablesync
worker would be enough while issuing a DROP SUBSCRIPTION.

So, thoughts or comments?
--
Michael

Вложения

Re: Orphaned records in pg_replication_origin_status after subscription drop

От
Amit Kapila
Дата:
On Fri, Dec 19, 2025 at 10:42 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> Some colleagues have reported that it is possible to finish with
> orphaned records in pg_replication_origin_status, as an effect of
> table synchronization workers that miss some cleanup actions around
> replorigin_drop_by_name() when a subscription that spawned these
> workers (which themselves create origins through replorigin_create()
> called in LogicalRepSyncTableStart()) is dropped.
>
> Please find attached a small script, courtesy of Tenglong Gu and
> Daisuke Higuchi, that have been digging into all that.
>
> This problem is true since ce0fdbfe9722 that has added the creation of
> a replication origin for table synchronization workers, as of 14 up to
> HEAD.  One issue with these origin states is that they are sticky: a
> restart of the node that held the subscription keeps them around due
> to them getting flushed in replorigin_checkpoint.  If my understanding
> is right, this also means that origin slots are still tracked as in
> use, implying that replorigin_session_setup() would not be able to
> reuse them as far as I understand.
>
> So, in summary (please correct me if necessary), DropSubscription()
> attempts a round of cleanup for all the replication origins with a
> loop over ReplicationOriginNameForLogicalRep(), and while in this case
> we attempt to drop the origins created by the workers that are tracked
> as orphaned, the syscache lookup of REPLORIGIDENT fails within the
> DROP SUBSCRIPTION because replorigin_by_name() cannot find an entry:
> the replication origin is already gone, but its state has persisted in
> memory.  Then, why would the replication origin be already gone with
> its state in memory not cleaned up yet?  Well, the problematic test
> case shows that the subscription is dropped while the spawned
> tablesync workers do the initial table copy, and the replication
> origin is created in the same transaction as the COPY.  DROP
> SUBSCRIPTION tells the tablesync workers to stop, they stop with the
> COPY failing and the origin is not seen as something that exists for
> the session that drops the subscription.  The replication state in
> memory goes out of sync.
>
> So it looks like to me that we are missing a code path where the
> replication origin is dropped but we just ignore to reset the
> replication state in memory, leading to bloat of the origin slots
> available.  worker.c does things differently: a small transaction is
> used for the origin creation, hence we would never really see the
> replication state memory going out-of-sync with the replorigin
> catalog.
>
> One solution would be for tablesync.c to do the same as worker.c: we
> could use a short transaction that sets the memory state and creates
> the replication origin, then move to a second transaction for the
> COPY.  A second solution, slightly more complicated, is to create some
> specific logic to reset the progress state of the origin that's been
> created in the transaction that runs the initial COPY, which I guess
> should be in the shape of a transaction abort callback.  All these
> symptoms are pointing out that it is not a good idea, IMO, to expect a
> replication origin to exist when dropping a subscription when an
> origin has been created in the context of a transaction that can take
> a very long time to run, aka the initial tablesync's COPY, so using a
> short transaction feels like a good thing to do here.
>

Yeah, either of these ways are okay. The only minor point in creating
replication origin in a separate transaction is that the other part of
operations on origin still need be done at the current place, so the
origin handling will look a bit separated. So, I have a slight
preference towards PG_ENSURE_ERROR_CLEANUP solution where the callback
should clear origin state via replorigin_state_clear.

--
With Regards,
Amit Kapila.



Re: Orphaned records in pg_replication_origin_status after subscription drop

От
Masahiko Sawada
Дата:
On Fri, Dec 19, 2025 at 2:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Dec 19, 2025 at 10:42 AM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > Some colleagues have reported that it is possible to finish with
> > orphaned records in pg_replication_origin_status, as an effect of
> > table synchronization workers that miss some cleanup actions around
> > replorigin_drop_by_name() when a subscription that spawned these
> > workers (which themselves create origins through replorigin_create()
> > called in LogicalRepSyncTableStart()) is dropped.
> >
> > Please find attached a small script, courtesy of Tenglong Gu and
> > Daisuke Higuchi, that have been digging into all that.
> >
> > This problem is true since ce0fdbfe9722 that has added the creation of
> > a replication origin for table synchronization workers, as of 14 up to
> > HEAD.  One issue with these origin states is that they are sticky: a
> > restart of the node that held the subscription keeps them around due
> > to them getting flushed in replorigin_checkpoint.  If my understanding
> > is right, this also means that origin slots are still tracked as in
> > use, implying that replorigin_session_setup() would not be able to
> > reuse them as far as I understand.
> >
> > So, in summary (please correct me if necessary), DropSubscription()
> > attempts a round of cleanup for all the replication origins with a
> > loop over ReplicationOriginNameForLogicalRep(), and while in this case
> > we attempt to drop the origins created by the workers that are tracked
> > as orphaned, the syscache lookup of REPLORIGIDENT fails within the
> > DROP SUBSCRIPTION because replorigin_by_name() cannot find an entry:
> > the replication origin is already gone, but its state has persisted in
> > memory.  Then, why would the replication origin be already gone with
> > its state in memory not cleaned up yet?  Well, the problematic test
> > case shows that the subscription is dropped while the spawned
> > tablesync workers do the initial table copy, and the replication
> > origin is created in the same transaction as the COPY.  DROP
> > SUBSCRIPTION tells the tablesync workers to stop, they stop with the
> > COPY failing and the origin is not seen as something that exists for
> > the session that drops the subscription.  The replication state in
> > memory goes out of sync.
> >
> > So it looks like to me that we are missing a code path where the
> > replication origin is dropped but we just ignore to reset the
> > replication state in memory, leading to bloat of the origin slots
> > available.  worker.c does things differently: a small transaction is
> > used for the origin creation, hence we would never really see the
> > replication state memory going out-of-sync with the replorigin
> > catalog.

IIUC the similar issue happens also when the tablesync worker simply
errors out for some reason (e.g., conflicts), but in this case the
tablesync worker would retry from the scratch. While the replication
origin record is deleted during rollback, the origin slot is still
in-use but not acquired by anyone. After the tablesync worker
restarts, it would get a new origin ID and might re-use the orphaned
slot if the origin ID matches.

> >
> > One solution would be for tablesync.c to do the same as worker.c: we
> > could use a short transaction that sets the memory state and creates
> > the replication origin, then move to a second transaction for the
> > COPY.

Also, the tablesync worker should accept the case where the
replication origin already exists for the error cases instead of
raising an error:

 ereport(ERROR,
         (errcode(ERRCODE_DUPLICATE_OBJECT),
          errmsg("replication origin \"%s\" already exists",
                 originname)));

> > A second solution, slightly more complicated, is to create some
> > specific logic to reset the progress state of the origin that's been
> > created in the transaction that runs the initial COPY, which I guess
> > should be in the shape of a transaction abort callback.

I find that it's too much change for backpatching.

> So, I have a slight
> preference towards PG_ENSURE_ERROR_CLEANUP solution where the callback
> should clear origin state via replorigin_state_clear.

It would work too. Or I think we can do a similar thing in
replorigin_reset() for tablesync workers who are in
SUBREL_STATE_DATASYNC state. Both ways require exposing
replorigin_state_clear(), though.

I am slightly leaning towards the idea of using a short transaction
because the tablesync worker would do things closer to the apply
workers and it would also fix the odd behavior that
pg_replication_origin_status shows NULL in the external_id column for
the origins while the COPY is being executed. But we need to verify if
it's really okay to reuse the existing origin instead of raising an
error in case where the tablesync worker retries to the data copy.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Orphaned records in pg_replication_origin_status after subscription drop

От
Michael Paquier
Дата:
On Fri, Dec 19, 2025 at 01:56:38PM -0800, Masahiko Sawada wrote:
> It would work too. Or I think we can do a similar thing in
> replorigin_reset() for tablesync workers who are in
> SUBREL_STATE_DATASYNC state. Both ways require exposing
> replorigin_state_clear(), though.

Exposing replorigin_state_clear() feels a bit weird here for something
that's only reserved for origin.c, and it would be OK, still..

> I am slightly leaning towards the idea of using a short transaction
> because the tablesync worker would do things closer to the apply
> workers and it would also fix the odd behavior that
> pg_replication_origin_status shows NULL in the external_id column for
> the origins while the COPY is being executed. But we need to verify if
> it's really okay to reuse the existing origin instead of raising an
> error in case where the tablesync worker retries to the data copy.

I also lean towards more consistency between the worker and tablesync
code, not less.  Hence I'd prefer a short transaction at the beginning
of the tablesync startup so as we can just rely on the DROP
SUBSCRIPTION code path to ensure that the origin is removed from the
catalogs and cleaned up in memory.

The case of pg_replication_origin_status showing NULL while COPY is
executing is an interesting point, didn't think about it yesterday,
but yes that seems like a good thing to have anyway.
--
Michael

Вложения