Обсуждение: Logical replication stalling for large tables with heavy write activity - Pg11

Поиск
Список
Период
Сортировка

Logical replication stalling for large tables with heavy write activity - Pg11

От
Mukesh Chhatani
Дата:
Hello Team,

I am working on upgrading Postgres at my current organization using logical replication.

In most of the cases Logical replication works fine, but in below scenario it just stalls for more than 24 hours

Primary DB - Pg10.11
Table Size - 90 GB including 20 GB of indexes

Secondary - Pg11.6

After few hours table is still being copied in the pg_stat_replication and plg_replication_slot I can replication is live currently and table size on the secondary is increasing, 

When I check the table size on the primary using pg_size_pretty(pg_total_relation_size('test')), it shows 90 GB but on the secondary where replication is happening it shows more than 150GB, and still replication is in progress with no errors in the logs.

This problem does not happen with tables which have less write activity (updates/inserts).

Please let me know if this is a known issue or is there any way I can fix this problem using some configuration.

Any insights would be gratefully appreciated.

Thanks & Regards,
Mukesh Chhatani

Re: Logical replication stalling for large tables with heavy write activity - Pg11

От
Amit Kapila
Дата:
On Sun, Aug 16, 2020 at 3:48 AM Mukesh Chhatani
<chhatani.mukesh@gmail.com> wrote:
>
> Hello Team,
>
> I am working on upgrading Postgres at my current organization using logical replication.
>
> In most of the cases Logical replication works fine, but in below scenario it just stalls for more than 24 hours
>
> Primary DB - Pg10.11
> Table Size - 90 GB including 20 GB of indexes
>
> Secondary - Pg11.6
>
> After few hours table is still being copied in the pg_stat_replication and plg_replication_slot I can replication is
livecurrently and table size on the secondary is increasing,
 
>

When you say replication is live, what do you mean by that? Isn't it
expected for a heavy-write table because you are updating the table on
the primary so secondary should also be getting the WAL corresponding
to it?

> When I check the table size on the primary using pg_size_pretty(pg_total_relation_size('test')), it shows 90 GB but
onthe secondary where replication is happening it shows more than 150GB, and still replication is in progress with no
errorsin the logs.
 
>
> This problem does not happen with tables which have less write activity (updates/inserts).
>

One way this could happen is if on the new server autovacuum is not
able to clean the table with heavy write activity. Are settings of
autovacuum on both servers the same?

-- 
With Regards,
Amit Kapila.



Re: Logical replication stalling for large tables with heavy write activity - Pg11

От
Mukesh Chhatani
Дата:
Thanks for checking and responding back, after detailed investigation of the logs found that logical replication was timing out on initial sync of large tables and initial copy of the tables would just restart.

So I changed parameters - max-standby_streaming_delay and max_standby_archive_delay and then the issue no longer happens.

Thanks & Regards,
Mukesh Chhatani

On Mon, Aug 24, 2020 at 9:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Sun, Aug 16, 2020 at 3:48 AM Mukesh Chhatani
<chhatani.mukesh@gmail.com> wrote:
>
> Hello Team,
>
> I am working on upgrading Postgres at my current organization using logical replication.
>
> In most of the cases Logical replication works fine, but in below scenario it just stalls for more than 24 hours
>
> Primary DB - Pg10.11
> Table Size - 90 GB including 20 GB of indexes
>
> Secondary - Pg11.6
>
> After few hours table is still being copied in the pg_stat_replication and plg_replication_slot I can replication is live currently and table size on the secondary is increasing,
>

When you say replication is live, what do you mean by that? Isn't it
expected for a heavy-write table because you are updating the table on
the primary so secondary should also be getting the WAL corresponding
to it?

> When I check the table size on the primary using pg_size_pretty(pg_total_relation_size('test')), it shows 90 GB but on the secondary where replication is happening it shows more than 150GB, and still replication is in progress with no errors in the logs.
>
> This problem does not happen with tables which have less write activity (updates/inserts).
>

One way this could happen is if on the new server autovacuum is not
able to clean the table with heavy write activity. Are settings of
autovacuum on both servers the same?

--
With Regards,
Amit Kapila.