Re: broken tables on hot standby after migration on PostgreSQL 16 (3x times last month)

Поиск
Список
Период
Сортировка
От Pavel Stehule
Тема Re: broken tables on hot standby after migration on PostgreSQL 16 (3x times last month)
Дата
Msg-id CAFj8pRANkt0_dYDRFHmx5un9x+A4LCo9XTmoEzySY5FzxmRq+A@mail.gmail.com
обсуждение исходный текст
Ответ на Re: broken tables on hot standby after migration on PostgreSQL 16 (3x times last month)  (Andres Freund <andres@anarazel.de>)
Ответы Re: broken tables on hot standby after migration on PostgreSQL 16 (3x times last month)
Список pgsql-hackers


pá 17. 5. 2024 v 21:50 odesílatel Andres Freund <andres@anarazel.de> napsal:
Hi,

On 2024-05-17 15:12:31 +0200, Pavel Stehule wrote:
> after migration on PostgreSQL 16 I seen 3x times (about every week) broken
> tables on replica nodes. The query fails with error

Migrating from what version?

I think 14, but it should be verified tomorrow 


You're saying that the data is correctly accessible on primaries, but broken
on standbys? Is there any difference in how the page looks like on the primary
vs standby?

I saved one page from master and standby. Can I send it privately? There are some private data (although not too sensitive)
 


> ERROR:  could not access status of transaction 1442871302
> DETAIL:  Could not open file "pg_xact/0560": No such file or directory
>
> verify_heapam reports
>
> ^[[Aprd=# select * from verify_heapam('account_login_history') where blkno
> = 179036;
>  blkno  | offnum | attnum |                                msg
>
> --------+--------+--------+-------------------------------------------------------------------
>  179036 |     30 |        | xmin 1393743382 precedes oldest valid
> transaction ID 3:1687012112

So that's not just a narrow race...


> master
>
> (2024-05-17 14:36:57) prd=# SELECT * FROM
> page_header(get_raw_page('account_login_history', 179036));
>       lsn      │ checksum │ flags │ lower │ upper │ special │ pagesize │
> version │ prune_xid
> ───────────────┼──────────┼───────┼───────┼───────┼─────────┼──────────┼─────────┼───────────
>  A576/810F4CE0 │        0 │     4 │   296 │   296 │    8192 │     8192 │
>     4 │         0
> (1 row)
>
>
> replica
> prd_aukro=# SELECT * FROM page_header(get_raw_page('account_login_history',
> 179036));
>       lsn      | checksum | flags | lower | upper | special | pagesize |
> version | prune_xid
> ---------------+----------+-------+-------+-------+---------+----------+---------+-----------
>  A56C/63979DA0 |        0 |     0 |   296 |   296 |    8192 |     8192 |
>     4 |         0
> (1 row)

Is the replica behind the primary? Or did we somehow end up with diverging
data? The page LSNs differ by about 40GB...

Is there evidence of failed truncations of the relation in the log? From
autovacuum?

no I did not see these bugs,

Does the data in the readable versions of the tuples on that page actually
look valid? Is it possibly duplicated data?

looks well, I didn't see any strange content


I'm basically wondering whether it's possible that we errored out during
truncation (e.g. due to a file permission issue or such). Due to some
brokenness in RelationTruncate() that can lead to data divergence between
primary and standby and to old tuples re-appearing on either.


Another question: Do you use pg_repack or such?

pg_repack was used 2 months before migration 

 

Greetings,

Andres Freund

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Geoghegan
Дата:
Сообщение: Re: broken tables on hot standby after migration on PostgreSQL 16 (3x times last month)
Следующее
От: Mark Dilger
Дата:
Сообщение: Re: [PATCH] Improve amcheck to also check UNIQUE constraint in btree index.