RE: Excessive number of replication slots for 12->14 logical replication

Поиск
Список
Период
Сортировка
От houzj.fnst@fujitsu.com
Тема RE: Excessive number of replication slots for 12->14 logical replication
Дата
Msg-id OS0PR01MB5716ED883D44E2214F3563B094959@OS0PR01MB5716.jpnprd01.prod.outlook.com
обсуждение исходный текст
Ответ на Re: Excessive number of replication slots for 12->14 logical replication  (Ajin Cherian <itsajin@gmail.com>)
Ответы Re: Excessive number of replication slots for 12->14 logical replication  (Ajin Cherian <itsajin@gmail.com>)
Список pgsql-bugs
On Sunday, July 24, 2022 4:17 PM Ajin Cherian <itsajin@gmail.com> wrote:
> On Sun, Jul 24, 2022 at 6:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Jul 18, 2022 at 3:13 PM hubert depesz lubaczewski
> > <depesz@depesz.com> wrote:
> > >
> > > On Mon, Jul 18, 2022 at 09:07:35AM +0530, Amit Kapila wrote:
> > >
> > > First error:
> > > #v+
> > > 2022-07-18 09:22:07.046 UTC,,,4145917,,62d5263f.3f42fd,2,,2022-07-18
> > > 09:22:07 UTC,28/21641,1219146,ERROR,53400,"could not find free
> > > replication state slot for replication origin with OID
> > > 51",,"Increase max_replication_slots and try
> > > again.",,,,,,,"","logical replication worker",,0
> > > #v-
> > >
> > > Nothing else errored out before, no warning, no fatals.
> > >
> > > from the first ERROR I was getting them in the range of 40-70 per minute.
> > >
> > > At the same time I was logging data from `select now(), * from
> pg_replication_slots`, every 2 seconds.
> > >
> > ...
> > >
> > > So, it looks that there are up to 10 focal slots, all active, and then there are
> sync slots with weirdly high counts for inactive ones.
> > >
> > > At most, I had 11 active sync slots.
> > >
> > > Looks like some kind of timing issue, which would be inline with
> > > what Kyotaro Horiguchi wrote initially.
> > >
> >
> > I think this is a timing issue similar to what Horiguchi-San has
> > pointed out but due to replication origins. We drop the replication
> > origin after the sync worker that has used it is finished. This is
> > done by the apply worker because we don't allow to drop the origin
> > till the process owning the origin is alive. I am not sure of
> > repercussions but maybe we can allow dropping the origin by the
> > process that owns it.
> >
> 
> I have written a patch which will do the dropping of replication origins in the
> sync worker itself.
> I had to reset the origin session (which also resets the owned by
> flag) prior to the dropping of the slots.

Thanks for the patch.

I tried the patch and confirmed that we won't get the ERROR "could not find
free replication state slot for replication origin with OID" again after
applying the patch.

I tested the patch by letting the apply worker wait for a bit more time after
setting the state to SUBREL_STATE_CATCHUP. In this case(before the patch) the
table sync worker will exit before the apply worker drop the replorigin, and
the apply worker will try to start another worker which would cause the
ERROR(before the patch).

Few comments:

1)
-                 * There is a chance that the user is concurrently performing
-                 * refresh for the subscription where we remove the table
-                 * state and its origin and by this time the origin might be
-                 * already removed. So passing missing_ok = true.
-                 */

I think it would be better if we can move these comments to the new place where
we drop the replorigin.


2)

-                replorigin_drop_by_name(originname, true, false);
 
                 /*
                  * Update the state to READY only after the origin cleanup.

Do we need to slightly modify the comment here as the origin drop code has been
moved to other places. Maybe "It's safe to update the state to READY as the
origin should have been dropped by table sync worker".

Best regards,
Hou zj

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Kyotaro Horiguchi
Дата:
Сообщение: Re: could not link file in wal restore lines
Следующее
От: Ajin Cherian
Дата:
Сообщение: Re: Excessive number of replication slots for 12->14 logical replication