Re: Synchronizing slots from primary to standby

Поиск
Список
Период
Сортировка
От Amit Kapila
Тема Re: Synchronizing slots from primary to standby
Дата
Msg-id CAA4eK1Jun8SGCoc6JEktxY_+L7GmoJWrdsx-KCEP=GL-SsWggQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Synchronizing slots from primary to standby  (Amit Kapila <amit.kapila16@gmail.com>)
Ответы Re: Synchronizing slots from primary to standby  (Bertrand Drouvot <bertranddrouvot.pg@gmail.com>)
RE: Synchronizing slots from primary to standby  ("Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>)
Список pgsql-hackers
On Fri, Feb 16, 2024 at 11:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Thanks for noticing this. I have pushed all your debug patches. Let's
> hope if there is a BF failure next time, we can gather enough
> information to know the reason of the same.
>

There is a new BF failure [1] after adding these LOGs and I think I
know what is going wrong. First, let's look at standby LOGs:

2024-02-16 06:18:18.442 UTC [241414][client backend][2/14:0] DEBUG:
segno: 4 of purposed restart_lsn for the synced slot, oldest_segno: 4
available
2024-02-16 06:18:18.443 UTC [241414][client backend][2/14:0] DEBUG:
xmin required by slots: data 0, catalog 741
2024-02-16 06:18:18.443 UTC [241414][client backend][2/14:0] LOG:mote
 could not sync slot information as reslot precedes local slot: remote
slot "lsub1_slot": LSN (0/4000168), catalog xmin (739) local slot: LSN
(0/4000168), catalog xmin (741)

So, from the above LOG, it is clear that the remote slot's catalog
xmin (739) precedes the local catalog xmin (741) which makes the sync
on standby to not complete.

Next, let's look at the LOG from the primary during the nearby time:
2024-02-16 06:18:11.354 UTC [238037][autovacuum worker][5/17:0] DEBUG:
 analyzing "pg_catalog.pg_depend"
2024-02-16 06:18:11.360 UTC [238037][autovacuum worker][5/17:0] DEBUG:
 "pg_depend": scanned 13 of 13 pages, containing 1709 live rows and 0
dead rows; 1709 rows in sample, 1709 estimated total rows
...
2024-02-16 06:18:11.372 UTC [238037][autovacuum worker][5/0:0] DEBUG:
Autovacuum VacuumUpdateCosts(db=1, rel=14050, dobalance=yes,
cost_limit=200, cost_delay=2 active=yes failsafe=no)
2024-02-16 06:18:11.372 UTC [238037][autovacuum worker][5/19:0] DEBUG:
 analyzing "information_schema.sql_features"
2024-02-16 06:18:11.377 UTC [238037][autovacuum worker][5/19:0] DEBUG:
 "sql_features": scanned 8 of 8 pages, containing 756 live rows and 0
dead rows; 756 rows in sample, 756 estimated total rows

It shows us that autovacuum worker has analyzed catalog table and for
updating its statistics in pg_statistic table, it would have acquired
a new transaction id. Now, after the slot creation, a new transaction
id that has updated the catalog is generated on primary and would have
been replication to standby. Due to this catalog_xmin of primary's
slot would precede standby's catalog_xmin and we see this failure.

As per this theory, we should disable autovacuum on primary to avoid
updates to catalog_xmin values.




[1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2024-02-16%2006%3A12%3A59

--
With Regards,
Amit Kapila.



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bharath Rupireddy
Дата:
Сообщение: Re: Improve WALRead() to suck data directly from WAL buffers when possible
Следующее
От: Bertrand Drouvot
Дата:
Сообщение: Re: Synchronizing slots from primary to standby