Silent data corruption in PostgreSQL 17 - how to detect it proactively?
От | Pawel Kudzia |
---|---|
Тема | Silent data corruption in PostgreSQL 17 - how to detect it proactively? |
Дата | |
Msg-id | CAJYBUS9T=JCVGkb9yXn5GVYC_g-e5mRi4MFoNHKEZV0keuHHwQ@mail.gmail.com обсуждение исходный текст |
Ответы |
Re: Silent data corruption in PostgreSQL 17 - how to detect it proactively?
|
Список | pgsql-general |
I've hit a silent data corruption for TOAST data - leading to some infinite loop when accessing bytea column for very particular row. I did not suffer data loss - data from streaming replica was fine, I've used it to rebuild the main server. I'm wondering if there's any proactive way of detecting that type of an issue rather than discovering pile-up of SELECT queries leading to CPU starvation or finding hanged backup jobs. Thanks in advance for your suggestions! I was originally running PostgreSQL 17.2 installed from project's deb packages, under Debian 12 on amd64. Environment is - without any recent crashes, with ECC memory and server-grade hardware. I run pg_dumpall every 24h, on a certain day it hanged. On re-run it hanged again, each time leaving single PosgreSQL process using 100% of single CPU core. I've narrowed this problem to specific rows in a specific table. Table definition: CREATE TABLE public.generic_storage ( id integer NOT NULL, content_type smallint NOT NULL, model_type smallint NOT NULL, content_updated_date timestamp without time zone NOT NULL, stored_date timestamp without time zone NOT NULL, content_data bytea NOT NULL, data_source smallint DEFAULT '-1'::integer NOT NULL ); ALTER TABLE ONLY public.generic_storage ADD CONSTRAINT generic_storage_unique UNIQUE (id, content_type, data_source); CREATE INDEX idx_generic_storage_content_type_id ON public.generic_storage USING btree (content_type, id); CREATE INDEX idx_generic_storage_content_updated_date ON public.generic_storage USING btree (content_updated_date); Query that works fine: SELECT id, content_type, model_type, content_updated_date, stored_date, data_source FROM generic_storage WHERE id=79923514 AND content_type=18; EXPLAIN ANALYZE SELECT id, content_type, model_type, content_updated_date, stored_date, data_source FROM generic_storage WHERE id=79923514 AND content_type=18; Index Scan using idx_generic_storage_content_type_id on generic_storage (cost=0.57..88.04 rows=30 width=26) (actual time=0.055..0.059 rows=1 loops=1) Index Cond: ((content_type = 18) AND (id = 79923514)) Planning Time: 0.191 ms Execution Time: 0.119 ms (4 rows) Query that hangs: SELECT content_data FROM generic_storage WHERE id=79923514 AND content_type=18; I've upgrade PostgreSQL to the latest 17.6-1.pgdg12+1 - and still get this issue. gdb stack trace dumped for process "postgres: 17/main: postgres application_cache [local] SELECT" taking 100% of CPU time: #0 0x0000561b4a829fd1 in ?? () #1 0x0000561b4a82eb49 in ReleaseAndReadBuffer () #2 0x0000561b4a557573 in _bt_relandgetbuf () #3 0x0000561b4a55cb9b in _bt_moveright () #4 0x0000561b4a55cd69 in _bt_search () #5 0x0000561b4a55db83 in _bt_first () #6 0x0000561b4a559bbf in btgettuple () #7 0x0000561b4a551289 in index_getnext_tid () #8 0x0000561b4a5513ed in index_getnext_slot () #9 0x0000561b4a550957 in systable_getnext_ordered () #10 0x0000561b4a5472bd in heap_fetch_toast_slice () #11 0x0000561b4a500bae in ?? () #12 0x0000561b4a500fc5 in detoast_attr () #13 0x0000561b4a97dbd6 in byteaout () #14 0x0000561b4a9b19a1 in OutputFunctionCall () #15 0x0000561b4a504f85 in ?? () #16 0x0000561b4a6b01ef in standard_ExecutorRun () #17 0x00007f0aa982a47d in ?? () from /usr/lib/postgresql/17/lib/pg_stat_statements.so #18 0x0000561b4a865ce9 in ?? () #19 0x0000561b4a867172 in PortalRun () #20 0x0000561b4a8634a5 in ?? () #21 0x0000561b4a86482d in PostgresMain () #22 0x0000561b4a85fa3f in BackendMain () #23 0x0000561b4a7cd2aa in postmaster_child_launch () #24 0x0000561b4a7d0db9 in ?? () #25 0x0000561b4a7d2cc8 in PostmasterMain () #26 0x0000561b4a4f4a11 in main () PostgreSQL's log does not have any error messages indicating data corruption. This server does not have checksums enabled, but - as I understand - such checksums cannot be checked online anyway. So - how can I detect similar corruption? Thank you! -- regards, Pawel Kudzia
В списке pgsql-general по дате отправления: