Re: WAL accumulating, Logical Replication pg 13

Поиск
Список
Период
Сортировка
От Willy-Bas Loos
Тема Re: WAL accumulating, Logical Replication pg 13
Дата
Msg-id CAHnozTjeBpon39FHGyOZ=1BL=x6yp+5E+pJix+aqFev46YX0Kw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: WAL accumulating, Logical Replication pg 13  (Vijaykumar Jain <vijaykumarjain.github@gmail.com>)
Ответы Re: WAL accumulating, Logical Replication pg 13  (Vijaykumar Jain <vijaykumarjain.github@gmail.com>)
Список pgsql-general
Yeah, indexes could slow things down, thanks. Btw I'm not using logical replication for the upgrade, that's not supported for 9.3.
It was more complicated but that's beside the point.

I could just delete the publication and all that belongs to it and start over. But since I'm trying out logical replication, I would like to be more in control than that. It's there anything that I can dig into to find out why the WAL is accumulating?

Op vr 28 mei 2021 22:20 schreef Vijaykumar Jain <vijaykumarjain.github@gmail.com>:
I am not too sure with 9.3
i tried an upgrade from 9.6 to 11 using logical replication (pg_logical extension)

one thing to note.
logical replication initiates a copy from a snapshot, then changes from then on.

I had a very high insert rate on my source tables (v9.6) and the destination (v11) could not keep up (it had tons of indexes when I copied the schema) and it took around a day as the table had around 12 indexes.

So at the destination(v11), I dropped all but the primary index for each table, started subscription and when it was almost caught up, rebuilt the index on the destination concurrently.
it completed in 4-5 hours without stopping the source.
migration completed in a few mins :)

not sure if this would help, but just FYI.


On Sat, 29 May 2021 at 01:36, Willy-Bas Loos <willybas@gmail.com> wrote:
Hi , I'm upgrading a 1.5TB database from postgres 9.3 to postgres 13 on Debian 10. This is now in an Acceptance stage (DTAP). I have encountered a problem: the WAL is not being deleted. I now have 1.4 TB of WAL in pg_wal and my disks are getting full. The oldest WAL file is 18 days old.
I use Logical Replication from the new cluster to another new cluster with 1 subscriber and 1 subscription.

pg_stat_subscription tells me all recent timestamps.
and this:
db=# select * from pg_replication_slots;
-[ RECORD 1 ]-------+-------------
slot_name           | my_pub1
plugin              | pgoutput
slot_type           | logical
datoid              | 16401
database            | db
temporary           | f
active              | t
active_pid          | 9480
xmin                |
catalog_xmin        | 269168
restart_lsn         | D4/908BC268
confirmed_flush_lsn | E1/25BF5710
wal_status          | extended
safe_wal_size       |



I've had problems with diskspace on this server, with postgres crashing because of it, then added more diskspace and postgres recovered. This doesn't seem to be a problem now.

The publication has the options publish = 'insert, update, delete, truncate', publish_via_partition_root = false
The subscription has the options connect = true, enabled = true, create_slot = false, slot_name = my_pub1, synchronous_commit = 'off'

The log on the publisher says:
2021-05-25 21:25:18.973 CEST [4584] user@db LOG:  starting logical decoding for slot "my_pub1"
2021-05-25 21:25:18.973 CEST [4584] user@db DETAIL:  Streaming transactions committing after D6/A82B5FE0, reading WAL from D4/908BC268.
2021-05-25 21:25:18.973 CEST [4584] user@db LOG:  logical decoding found consistent point at D4/908BC268
2021-05-25 21:25:18.973 CEST [4584] user@db DETAIL:  There are no running transactions.
2021-05-25 21:29:49.456 CEST [4614] user@db ERROR:  replication slot "my_pub1" is active for PID 4584
2021-05-25 21:29:54.474 CEST [4615] user@db ERROR:  replication slot "my_pub1" is active for PID 4584

And on the subscriber:
2021-05-28 21:23:46.702 CEST [40039] LOG:  logical replication apply worker for subscription "my_pub1" has started
2021-05-28 21:23:46.712 CEST [40039] ERROR:  could not start WAL streaming: ERROR:  replication slot "my_pub1" is active for PID 730
2021-05-28 21:23:46.714 CEST [19794] LOG:  background worker "logical replication worker" (PID 40039) exited with exit code 1

The postgres settings on the publisher are:
max_connections = 100 # (change requires restart)
tcp_keepalives_idle = 120 # TCP_KEEPIDLE, in seconds;
shared_buffers = 50GB # min 128kB
work_mem = 1GB # min 64kB
maintenance_work_mem = 10GB # min 1MB
logical_decoding_work_mem = 5GB # min 64kB
dynamic_shared_memory_type = posix # the default is the first option
max_worker_processes = 20 # (change requires restart)
max_parallel_maintenance_workers = 10 # taken from max_parallel_workers
max_parallel_workers_per_gather = 5 # taken from max_parallel_workers
max_parallel_workers = 15 # maximum number of max_worker_processes that
wal_level = logical # minimal, replica, or logical
max_wal_size = 1GB
min_wal_size = 80MB
#archive_mode = off
max_wal_senders = 10 # max number of walsender processes
wal_sender_timeout = 1min # in milliseconds; 0 disables
max_replication_slots = 7 # max number of replication slots

On postgres settings on the subscriber:
max_connections = 100 # (change requires restart)
tcp_keepalives_idle = 120 # TCP_KEEPIDLE, in seconds;
shared_buffers = 25GB # min 128kB
work_mem = 1GB # min 64kB
maintenance_work_mem = 10GB # min 1MB
logical_decoding_work_mem = 5GB # min 64kB
dynamic_shared_memory_type = posix # the default is the first option
max_worker_processes = 20 # (change requires restart)
max_parallel_maintenance_workers = 10 # taken from max_parallel_workers
max_parallel_workers_per_gather = 5 # taken from max_parallel_workers
max_parallel_workers = 15 # maximum number of max_worker_processes that
wal_level = logical # minimal, replica, or logical
max_wal_size = 3GB
min_wal_size = 80MB
#archive_mode = off
wal_receiver_timeout = 1min # time that receiver waits for
max_logical_replication_workers = 10 # taken from max_worker_processes
max_sync_workers_per_subscription = 5 # taken from max_logical_replication_workers

I've tried increasing wal_sender_timeout and wal_receiver_timeout to 10 minutes each, but this had no positive effect.

Some advice would be helpful
--
Willy-Bas Loos


--
Thanks,
Vijay
Mumbai, India

В списке pgsql-general по дате отправления:

Предыдущее
От: Lionel Bouton
Дата:
Сообщение: Re: AWS forcing PG upgrade from v9.6 a disaster
Следующее
От: Laura Smith
Дата:
Сообщение: Overriding natural order of query results for a subset