Re: Silent data corruption in PostgreSQL 17 - how to detect it proactively?
От | Pawel Kudzia |
---|---|
Тема | Re: Silent data corruption in PostgreSQL 17 - how to detect it proactively? |
Дата | |
Msg-id | CAJYBUS9pUjZbqsndbjdHUNugonfQ4DRr_HEn0QkyTXsds0VLMQ@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Silent data corruption in PostgreSQL 17 - how to detect it proactively? (Laurenz Albe <laurenz.albe@cybertec.at>) |
Ответы |
Re: Silent data corruption in PostgreSQL 17 - how to detect it proactively?
|
Список | pgsql-general |
On Sun, Sep 14, 2025 at 12:35 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote: > > On Sun, 2025-09-14 at 10:30 +0200, Pawel Kudzia wrote: > > I've hit a silent data corruption for TOAST data - leading to some > > infinite loop when accessing bytea column for very particular row. I > > did not suffer data loss - data from streaming replica was fine, I've > > used it to rebuild the main server. > > > > I'm wondering if there's any proactive way of detecting that type of > > an issue rather than discovering pile-up of SELECT queries leading to > > CPU starvation or finding hanged backup jobs. > > > > Thanks in advance for your suggestions! > > > > I was originally running PostgreSQL 17.2 installed from project's deb > > packages, under Debian 12 on amd64. Environment is - without any > > recent crashes, with ECC memory and server-grade hardware. I run > > pg_dumpall every 24h, on a certain day it hanged. On re-run it hanged > > again, each time leaving single PosgreSQL process using 100% of single > > CPU core. > > > > [...] > > > > PostgreSQL's log does not have any error messages indicating data corruption. > > > > This server does not have checksums enabled, but - as I understand - > > such checksums cannot be checked online anyway. So - how can I detect > > similar corruption? > > A proactive way of detecting TOAST corruption... how about > > pg_dump -f /dev/null yourdatabase > > If there is TOAST corruption, that should give you an error. > > If your standby does not have the same problem, that increases the > likelihood that the cause is a hardware problem. Of course, it could > still be a software bug. > > Yours, > Laurenz Albe Thanks for your answer! pg_dump mydatabase ends with ~the same as the SELECT statement. It hangs leaving 'postgres: 17/main: postgres application_cache [local] COPY' using 100% of single CPU core and no messages in PG's logs. gdb stack trace for that process: #0 0x000055cb571ef444 in hash_search_with_hash_value () #1 0x000055cb5706217a in BufTableLookup () #2 0x000055cb5706715f in StartReadBuffer () #3 0x000055cb57068671 in ReadBufferExtended () #4 0x000055cb56d91573 in _bt_relandgetbuf () #5 0x000055cb56d96b9b in _bt_moveright () #6 0x000055cb56d96d69 in _bt_search () #7 0x000055cb56d97b83 in _bt_first () #8 0x000055cb56d93bbf in btgettuple () #9 0x000055cb56d8b289 in index_getnext_tid () #10 0x000055cb56d8b3ed in index_getnext_slot () #11 0x000055cb56d8a957 in systable_getnext_ordered () #12 0x000055cb56d812bd in heap_fetch_toast_slice () #13 0x000055cb56d3abae in ?? () #14 0x000055cb56d3afc5 in detoast_attr () #15 0x000055cb571b7bd6 in byteaout () #16 0x000055cb571eb9a1 in OutputFunctionCall () #17 0x000055cb56e711c7 in ?? () #18 0x000055cb56e71b2f in DoCopyTo () #19 0x000055cb56e6c06f in DoCopy () #20 0x000055cb570a2726 in standard_ProcessUtility () #21 0x00007fd02f890a81 in ?? () from /usr/lib/postgresql/17/lib/pg_stat_statements.so I was curious if there's any other tool that can spot that type of corruption and state it explicitly rather than guess it by monitoring execution time. #22 0x000055cb570a0c01 in ?? () #23 0x000055cb570a0d18 in ?? () #24 0x000055cb570a1296 in PortalRun () #25 0x000055cb5709d4a5 in ?? () #26 0x000055cb5709e82d in PostgresMain () #27 0x000055cb57099a3f in BackendMain () #28 0x000055cb570072aa in postmaster_child_launch () #29 0x000055cb5700adb9 in ?? () #30 0x000055cb5700ccc8 in PostmasterMain () #31 0x000055cb56d2ea11 in main () -- regards, Pawel Kudzia
В списке pgsql-general по дате отправления: