Обсуждение: Runaway Initial Table Syncs in Logical Replication?

Поиск
Список
Период
Сортировка

Runaway Initial Table Syncs in Logical Replication?

От
Don Seiler
Дата:
Logical Rep question. Publisher is PG12 on Ubuntu 18.04, subscriber is PG15 on Ubuntu 22.04.

I bumped up some values to see how initial load times change. I set max_logical_replication_workers to 20 and max_sync_workers_per_subscription to 4. I’m using 3 subscriptions, 2 of the subscriptions have 3 tables each, the other has a lot more, say 100 for the sake of example.What we noticed was that the subscriber basically maxed out the wal senders and replication slots on the publisher side, even when the publisher settings were bumped up to 50 each. 3 replication slots were for the 3 subs and the rest (47) were sync workers. Was creating a sync worker for each table right away, or at least trying to? The subscriber side was still complaining that it couldn’t create more replication slots on the publisher. 

I was expecting to see max_sync_workers_per_subscription (4) x # of subs (3) = 12 sync workers in action so this was a big surprise. Is this expected behavior?

--
Don Seiler
www.seiler.us

RE: Runaway Initial Table Syncs in Logical Replication?

От
"Zhijie Hou (Fujitsu)"
Дата:
On Friday, August 4, 2023 3:24 AM Don Seiler <don@seiler.us>  wrote:

Hi,

Thanks for reporting.

> Logical Rep question. Publisher is PG12 on Ubuntu 18.04, subscriber is PG15 on
> Ubuntu 22.04.

Just to confirm, which subversion of PG15 are you using on the subscriber side ?
Is it 15.3 ? I am confirming because I recall we fixed similar bug before in commit [1].

> 
> I bumped up some values to see how initial load times change. I set
> max_logical_replication_workers to 20 and max_sync_workers_per_subscription to
> 4. I’m using 3 subscriptions, 2 of the subscriptions have 3 tables each, the
> other has a lot more, say 100 for the sake of example.What we noticed was that
> the subscriber basically maxed out the wal senders and replication slots on the
> publisher side, even when the publisher settings were bumped up to 50 each. 3
> replication slots were for the 3 subs and the rest (47) were sync workers. Was
> creating a sync worker for each table right away, or at least trying to? The
> subscriber side was still complaining that it couldn’t create more replication
> slots on the publisher. 

Are there any ERRORs(including erros other than the slots number) in the log of publisher and subscriber.
It would be helpful to see the log in case there are some other ERRORs that could cause problem.

> 
> I was expecting to see max_sync_workers_per_subscription (4) x # of subs (3) =
> 12 sync workers in action so this was a big surprise. Is this expected
> behavior?


Best Regards,
Hou zj

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=f6c5edb8abcac04eb3eac6da356e59d399b2bcef

Re: Runaway Initial Table Syncs in Logical Replication?

От
Don Seiler
Дата:
On Thu, Aug 3, 2023 at 9:12 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:

Just to confirm, which subversion of PG15 are you using on the subscriber side ?
Is it 15.3 ? I am confirming because I recall we fixed similar bug before in commit [1].

Yes. Specifically on the target/subscriber:

# select version();
                                                              version
-----------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 15.3 (Ubuntu 15.3-1.pgdg22.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit
(1 row)


Installed on Ubuntu 22.04 via the PGDG apt repo. 

The source/publisher side is:

# select version();
                                                              version
-----------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 12.15 (Ubuntu 12.15-1.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit
(1 row)


Are there any ERRORs(including erros other than the slots number) in the log of publisher and subscriber.
It would be helpful to see the log in case there are some other ERRORs that could cause problem.

On the subscriber I see a series of errors similar to this:

2023-08-03 18:22:50.196 UTC [239889] LOG:  logical replication table synchronization worker for subscription "sub01", table "users" has started
2023-08-03 18:22:50.209 UTC [239889] ERROR:  could not create replication slot "pg_19742_sync_18602_7263122209699118815": ERROR:  all replication slots are in use
        HINT:  Free one or increase max_replication_slots.

on the publisher I see this corresponding message:

2023-08-03 18:22:50.199 UTC [13884] [unknown]@[unknown] - [unknown] LOG:  connection received: host=10.228.88.74 port=47338
2023-08-03 18:22:50.207 UTC [13884] migrator@foo - [unknown] LOG:  replication connection authorized: user=migrator application_name=pg_19742_sync_18602_7263122209699118815 SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384, bits=256, compression=off)
2023-08-03 18:22:50.209 UTC [13884] migrator@foo - pg_19742_sync_18602_7263122209699118815 ERROR:  all replication slots are in use
2023-08-03 18:22:50.209 UTC [13884] migrator@foo - pg_19742_sync_18602_7263122209699118815 HINT:  Free one or increase max_replication_slots.
2023-08-03 18:22:50.209 UTC [13884] migrator@foo - pg_19742_sync_18602_7263122209699118815 STATEMENT:  CREATE_REPLICATION_SLOT "pg_19742_sync_18602_7263122209699118815" LOGICAL pgoutput USE_SNAPSHOT
2023-08-03 18:22:50.209 UTC [13884] migrator@foo - pg_19742_sync_18602_7263122209699118815 LOG:  disconnection: session time: 0:00:00.009 user=migrator database=foo host=10.228.88.74 port=47338


and later they include this message about trying to drop them first on the subscriber:

2023-08-03 18:47:39.485 UTC [263502] LOG:  logical replication table synchronization worker for subscription "sub01", table "foo_accounts" has started
2023-08-03 18:47:39.496 UTC [263502] LOG:  could not drop replication slot "pg_19742_sync_17238_7263122209699118815" on publisher: ERROR:  replication slot "pg_19742_sync_17238_7263122209699118815" does not exist
2023-08-03 18:47:39.497 UTC [263502] ERROR:  could not create replication slot "pg_19742_sync_17238_7263122209699118815": ERROR:  all replication slots are in use
        HINT:  Free one or increase max_replication_slots.


and on the publisher side for that:

2023-08-03 18:47:39.487 UTC [3232] [unknown]@[unknown] - [unknown] LOG:  connection received: host=10.228.88.74 port=17390
2023-08-03 18:47:39.494 UTC [3232] migrator@foo - [unknown] LOG:  replication connection authorized: user=migrator application_name=pg_19742_sync_17238_7263122209699118815 SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384, bits=256, compression=off)
2023-08-03 18:47:39.495 UTC [3232] migrator@
foo - pg_19742_sync_17238_7263122209699118815 ERROR:  replication slot "pg_19742_sync_17238_7263122209699118815" does not exist
2023-08-03 18:47:39.495 UTC [3232] migrator@
foo - pg_19742_sync_17238_7263122209699118815 STATEMENT:  DROP_REPLICATION_SLOT pg_19742_sync_17238_7263122209699118815 WAIT
2023-08-03 18:47:39.496 UTC [3232] migrator@
foo - pg_19742_sync_17238_7263122209699118815 ERROR:  all replication slots are in use
2023-08-03 18:47:39.496 UTC [3232] migrator@
foo - pg_19742_sync_17238_7263122209699118815 HINT:  Free one or increase max_replication_slots.
2023-08-03 18:47:39.496 UTC [3232] migrator@
foo - pg_19742_sync_17238_7263122209699118815 STATEMENT:  CREATE_REPLICATION_SLOT "pg_19742_sync_17238_7263122209699118815" LOGICAL pgoutput USE_SNAPSHOT
2023-08-03 18:47:39.496 UTC [3232] migrator@
foo - pg_19742_sync_17238_7263122209699118815 LOG:  disconnection: session time: 0:00:00.009 user=migrator datab
ase=
foo host=10.228.88.74 port=17390

and then this from a different pid, maybe some kind of cleanup process? It was doing this for what seemed like all of the sync slots:

2023-08-03 18:47:47.975 UTC [3303] migrator@foo - sub01 ERROR:  replication slot "pg_19742_sync_17238_7263122209699118815" does not exist
2023-08-03 18:47:47.975 UTC [3303] migrator@
foo - sub01 STATEMENT:  DROP_REPLICATION_SLOT pg_19742_sync_17238_7263122209699118815 WAIT

--
Don Seiler
www.seiler.us