Обсуждение: BUG #17524: Increase in WAL size due to logical replication with publication contain a table with low activity.

Поиск
Список
Период
Сортировка
The following bug has been logged on the website:

Bug reference:      17524
Logged by:          Raman Kumar
Email address:      raman.kumar@r-indventures.com
PostgreSQL version: 13.6
Operating system:   Ubuntu
Description:

We are using logical replication to source the event from postgres. and
publication includes a single table with (Insert, update, delete and
truncate ).
Everything works great, Until we run into a strange issue i.e. Increase is
the disk usage, abnormally.
On looking further we have found that some of the disk space occupy by the
wal.
Using this query
```
Select slot_name, pg_current_wal_lsn(), confirmed_flush_lsn,
(pg_current_wal_lsn() - confirmed_flush_lsn) as lag from
pg_replication_slots;
```
```
"slot_name"   "pg_current_wal_lsn"  "confirmed_flush_lsn" "lsn_distance"
"poc_kafka_connect"   "109C/AC001020"   "109C/94909990"   393180816
```

```
select usename, client_addr, pg_current_wal_lsn(), flush_lsn,
(pg_current_wal_lsn() - flush_lsn) as lag from pg_stat_replication;
```

```
"usename" "client_addr" "pg_current_wal_lsn"  "flush_lsn"   "lag"
"postgres"    "x.x.x.x"   "109C/AC0011A0"   "109C/94909990"   393181200
```

We have notice the replication lag. But as the publication includes on the
single table, the last changes are already replicated/published.
So Technically there is no replication lag as per publication.

It seems like increase in wal size links which this replication delay. As
due to logical replication enabled, postgres keeps the wal files which are
not yet to confirmed to be replicated.

To support this hypothesis when making changes in the table which includes
in the publication, wal size drop back to normal.

```
"usename" "client_addr" "pg_current_wal_lsn"  "flush_lsn"   "lag"
"postgres"    "172.31.28.199"   "109C/B80065E0"   "109C/B8002118"   17608
```

I have raised the question on the postgres slack group as well. 
Please find the conversion thread here
https://postgresteam.slack.com/archives/C0FS3UTAP/p1655704259823169


On Mon, Jun 20, 2022 at 7:07 PM PG Bug reporting form
<noreply@postgresql.org> wrote:
>
> The following bug has been logged on the website:
>
> Bug reference:      17524
> Logged by:          Raman Kumar
> Email address:      raman.kumar@r-indventures.com
> PostgreSQL version: 13.6
> Operating system:   Ubuntu
> Description:
>
> We are using logical replication to source the event from postgres. and
> publication includes a single table with (Insert, update, delete and
> truncate ).
> Everything works great, Until we run into a strange issue i.e. Increase is
> the disk usage, abnormally.
> On looking further we have found that some of the disk space occupy by the
> wal.
> Using this query
> ```
> Select slot_name, pg_current_wal_lsn(), confirmed_flush_lsn,
> (pg_current_wal_lsn() - confirmed_flush_lsn) as lag from
> pg_replication_slots;
> ```
> ```
> "slot_name"   "pg_current_wal_lsn"  "confirmed_flush_lsn" "lsn_distance"
> "poc_kafka_connect"   "109C/AC001020"   "109C/94909990"   393180816
> ```
>
> ```
> select usename, client_addr, pg_current_wal_lsn(), flush_lsn,
> (pg_current_wal_lsn() - flush_lsn) as lag from pg_stat_replication;
> ```
>
> ```
> "usename" "client_addr" "pg_current_wal_lsn"  "flush_lsn"   "lag"
> "postgres"    "x.x.x.x"   "109C/AC0011A0"   "109C/94909990"   393181200
> ```
>
> We have notice the replication lag. But as the publication includes on the
> single table, the last changes are already replicated/published.
> So Technically there is no replication lag as per publication.
>

The lag is computed based on the difference between the WAL location
written in the publisher and the WAL location confirmed by the
subscriber as replayed. Normally for non-published data, the latest
WAL location is confirmed by the subscriber either because we send an
empty transaction (pair of BEGIN ... COMMIT commands) or via
keep-alive messages. Now, it is not clear why none of those is working
in your case? Are there any large transactions happening on the
non-published tables? You can once check data/pg_replslot directory to
confirm if there are any spill files present there?

-- 
With Regards,
Amit Kapila.



We are using a debezium postgres kafka connector, it does handle the empty transaction as far as i know.

About the large transactions, no there aren't any as most of the time the disk usage spikes when the activity is really low.

I am using AWS RDS to host the postgres instance, so I don't really have an idea on how to check the data/pg_replslot directory.


In order to mitigate the issue temporarily, We have added a dummy table in the publication on which we are inserting random. This keeps the replication lag under control and avoid filling up the disk.

On Tue, Jun 21, 2022 at 7:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Jun 20, 2022 at 7:07 PM PG Bug reporting form
<noreply@postgresql.org> wrote:
>
> The following bug has been logged on the website:
>
> Bug reference:      17524
> Logged by:          Raman Kumar
> Email address:      raman.kumar@r-indventures.com
> PostgreSQL version: 13.6
> Operating system:   Ubuntu
> Description:
>
> We are using logical replication to source the event from postgres. and
> publication includes a single table with (Insert, update, delete and
> truncate ).
> Everything works great, Until we run into a strange issue i.e. Increase is
> the disk usage, abnormally.
> On looking further we have found that some of the disk space occupy by the
> wal.
> Using this query
> ```
> Select slot_name, pg_current_wal_lsn(), confirmed_flush_lsn,
> (pg_current_wal_lsn() - confirmed_flush_lsn) as lag from
> pg_replication_slots;
> ```
> ```
> "slot_name"   "pg_current_wal_lsn"  "confirmed_flush_lsn" "lsn_distance"
> "poc_kafka_connect"   "109C/AC001020"   "109C/94909990"   393180816
> ```
>
> ```
> select usename, client_addr, pg_current_wal_lsn(), flush_lsn,
> (pg_current_wal_lsn() - flush_lsn) as lag from pg_stat_replication;
> ```
>
> ```
> "usename" "client_addr" "pg_current_wal_lsn"  "flush_lsn"   "lag"
> "postgres"    "x.x.x.x"   "109C/AC0011A0"   "109C/94909990"   393181200
> ```
>
> We have notice the replication lag. But as the publication includes on the
> single table, the last changes are already replicated/published.
> So Technically there is no replication lag as per publication.
>

The lag is computed based on the difference between the WAL location
written in the publisher and the WAL location confirmed by the
subscriber as replayed. Normally for non-published data, the latest
WAL location is confirmed by the subscriber either because we send an
empty transaction (pair of BEGIN ... COMMIT commands) or via
keep-alive messages. Now, it is not clear why none of those is working
in your case? Are there any large transactions happening on the
non-published tables? You can once check data/pg_replslot directory to
confirm if there are any spill files present there?

--
With Regards,
Amit Kapila.