Re: Silent data corruption in PostgreSQL 17 - how to detect it proactively?

Поиск
Список
Период
Сортировка
От Pawel Kudzia
Тема Re: Silent data corruption in PostgreSQL 17 - how to detect it proactively?
Дата
Msg-id CAJYBUS9pUjZbqsndbjdHUNugonfQ4DRr_HEn0QkyTXsds0VLMQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Silent data corruption in PostgreSQL 17 - how to detect it proactively?  (Laurenz Albe <laurenz.albe@cybertec.at>)
Ответы Re: Silent data corruption in PostgreSQL 17 - how to detect it proactively?
Список pgsql-general
On Sun, Sep 14, 2025 at 12:35 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
>
> On Sun, 2025-09-14 at 10:30 +0200, Pawel Kudzia wrote:
> > I've hit a silent data corruption for TOAST data - leading to some
> > infinite loop when accessing bytea column for very particular row. I
> > did not suffer data loss - data from streaming replica was fine, I've
> > used it to rebuild the main server.
> >
> > I'm wondering if there's any proactive way of detecting that type of
> > an issue rather than discovering pile-up of SELECT queries leading to
> > CPU starvation or finding hanged backup jobs.
> >
> > Thanks in advance for your suggestions!
> >
> > I was originally running PostgreSQL 17.2 installed from project's deb
> > packages, under Debian 12 on amd64. Environment is - without any
> > recent crashes, with ECC memory and server-grade hardware. I run
> > pg_dumpall every 24h, on a certain day it hanged. On re-run it hanged
> > again, each time leaving single PosgreSQL process using 100% of single
> > CPU core.
> >
> > [...]
> >
> > PostgreSQL's log does not have any error messages indicating data corruption.
> >
> > This server does not have checksums enabled, but - as I understand -
> > such checksums cannot be checked online anyway. So - how can I detect
> > similar corruption?
>
> A proactive way of detecting TOAST corruption... how about
>
>   pg_dump -f /dev/null yourdatabase
>
> If there is TOAST corruption, that should give you an error.
>
> If your standby does not have the same problem, that increases the
> likelihood that the cause is a hardware problem.  Of course, it could
> still be a software bug.
>
> Yours,
> Laurenz Albe

Thanks for your answer! pg_dump mydatabase ends with ~the same as the
SELECT statement.
It hangs leaving 'postgres: 17/main: postgres application_cache
[local] COPY' using 100% of single CPU core and no messages in PG's
logs.

gdb stack trace for that process:

#0  0x000055cb571ef444 in hash_search_with_hash_value ()
#1  0x000055cb5706217a in BufTableLookup ()
#2  0x000055cb5706715f in StartReadBuffer ()
#3  0x000055cb57068671 in ReadBufferExtended ()
#4  0x000055cb56d91573 in _bt_relandgetbuf ()
#5  0x000055cb56d96b9b in _bt_moveright ()
#6  0x000055cb56d96d69 in _bt_search ()
#7  0x000055cb56d97b83 in _bt_first ()
#8  0x000055cb56d93bbf in btgettuple ()
#9  0x000055cb56d8b289 in index_getnext_tid ()
#10 0x000055cb56d8b3ed in index_getnext_slot ()
#11 0x000055cb56d8a957 in systable_getnext_ordered ()
#12 0x000055cb56d812bd in heap_fetch_toast_slice ()
#13 0x000055cb56d3abae in ?? ()
#14 0x000055cb56d3afc5 in detoast_attr ()
#15 0x000055cb571b7bd6 in byteaout ()
#16 0x000055cb571eb9a1 in OutputFunctionCall ()
#17 0x000055cb56e711c7 in ?? ()
#18 0x000055cb56e71b2f in DoCopyTo ()
#19 0x000055cb56e6c06f in DoCopy ()
#20 0x000055cb570a2726 in standard_ProcessUtility ()
#21 0x00007fd02f890a81 in ?? () from
/usr/lib/postgresql/17/lib/pg_stat_statements.so

I was curious if there's any other tool that can spot that type of
corruption and state it explicitly rather than guess it by monitoring
execution time.
#22 0x000055cb570a0c01 in ?? ()
#23 0x000055cb570a0d18 in ?? ()
#24 0x000055cb570a1296 in PortalRun ()
#25 0x000055cb5709d4a5 in ?? ()
#26 0x000055cb5709e82d in PostgresMain ()
#27 0x000055cb57099a3f in BackendMain ()
#28 0x000055cb570072aa in postmaster_child_launch ()
#29 0x000055cb5700adb9 in ?? ()
#30 0x000055cb5700ccc8 in PostmasterMain ()
#31 0x000055cb56d2ea11 in main ()





--
regards,
Pawel Kudzia



В списке pgsql-general по дате отправления: