Re: new heapcheck contrib module

Поиск
Список
Период
Сортировка
От Dilip Kumar
Тема Re: new heapcheck contrib module
Дата
Msg-id CAFiTN-syyHc3jZoou51v0SR8z0POoNfktqEO6MaGig4YS8mosA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: new heapcheck contrib module  (Mark Dilger <mark.dilger@enterprisedb.com>)
Ответы Re: new heapcheck contrib module  (Mark Dilger <mark.dilger@enterprisedb.com>)
Список pgsql-hackers
On Mon, May 11, 2020 at 10:51 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
>
> Here is v5 of the patch.  Major changes in this version include:
>
> 1) A new module, pg_amcheck, which includes a command line client for checking a database or subset of a database.
Internallyit functions by querying the database for a list of tables which are appropriate given the command line
switches,and then calls amcheck's functions to validate each table and/or index.  The options for selecting/excluding
tablesand schemas is patterned on pg_dump, on the assumption that interface is already familiar to users. 
>
> 2) amcheck's btree checking functions have been refactored to be able to operate in two modes; the original mode in
whichall errors are reported via ereport, and a new mode for returning errors as rows from a set returning function.
Thenew mode is used by a new function verify_btreeam(), analogous to verify_heapam(), both of which are used by the
pg_amcheckcommand line tool. 
>
> 3) The regression test which generates corruption within a table uses the pageinspect module to determine the
locationof each tuple on disk for corrupting.  This was suggested upthread. 
>
> Testing on the command line shows that the pre-existing btree checking code could use some hardening, as it currently
crashesthe backend on certain corruptions.  When I corrupt relation files for tables and indexes in the backend and
thenuse pg_amcheck to check all objects in the database, I keep getting assertions from the btree checking code.  I
thinkI need to harden this code, but wanted to post an updated patch and solicit opinions before doing so.  Here are
someexample problems I'm seeing.  Note the stack trace when calling from the command line tool includes the new
verify_btreeamfunction, but you can get the same crashes using the old interface via psql: 
>
> From psql, first error:
>
> test=# select bt_index_parent_check('corrupted_idx', true, true);
> TRAP: FailedAssertion("_bt_check_natts(rel, key->heapkeyspace, page, offnum)", File: "nbtsearch.c", Line: 663)
> 0   postgres                            0x0000000106872977 ExceptionalCondition + 103
> 1   postgres                            0x00000001063a33e2 _bt_compare + 1090
> 2   amcheck.so                          0x0000000106d62921 bt_target_page_check + 6033
> 3   amcheck.so                          0x0000000106d5fd2f bt_index_check_internal + 2847
> 4   amcheck.so                          0x0000000106d60433 bt_index_parent_check + 67
> 5   postgres                            0x00000001064d6762 ExecInterpExpr + 1634
> 6   postgres                            0x000000010650d071 ExecResult + 321
> 7   postgres                            0x00000001064ddc3d standard_ExecutorRun + 301
> 8   postgres                            0x00000001066600c5 PortalRunSelect + 389
> 9   postgres                            0x000000010665fc7f PortalRun + 527
> 10  postgres                            0x000000010665ed59 exec_simple_query + 1641
> 11  postgres                            0x000000010665c99d PostgresMain + 3661
> 12  postgres                            0x00000001065d6a8a BackendRun + 410
> 13  postgres                            0x00000001065d61c4 ServerLoop + 3044
> 14  postgres                            0x00000001065d2fe9 PostmasterMain + 3769
> 15  postgres                            0x000000010652e3b0 help + 0
> 16  libdyld.dylib                       0x00007fff6725fcc9 start + 1
> server closed the connection unexpectedly
>         This probably means the server terminated abnormally
>         before or while processing the request.
> The connection to the server was lost. Attempting reset: 2020-05-11 10:11:47.394 PDT [41091] LOG:  server process
(PID41309) was terminated by signal 6: Abort trap: 6 
>
>
>
> From commandline, second error:
>
> pgtest % pg_amcheck -i test
> (relname=corrupted,blkno=0,offnum=16,lp_off=7680,lp_flags=1,lp_len=31,attnum=,chunk=)
> tuple xmin = 3289393 is in the future
> (relname=corrupted,blkno=0,offnum=17,lp_off=7648,lp_flags=1,lp_len=31,attnum=,chunk=)
> tuple xmax = 0 precedes relation relminmxid = 1
> (relname=corrupted,blkno=0,offnum=17,lp_off=7648,lp_flags=1,lp_len=31,attnum=,chunk=)
> tuple xmin = 12593 is in the future
> (relname=corrupted,blkno=0,offnum=17,lp_off=7648,lp_flags=1,lp_len=31,attnum=,chunk=)
>
> <snip>
>
> (relname=corrupted,blkno=107,offnum=20,lp_off=7392,lp_flags=1,lp_len=34,attnum=,chunk=)
> tuple xmin = 306 precedes relation relfrozenxid = 487
> (relname=corrupted,blkno=107,offnum=22,lp_off=7312,lp_flags=1,lp_len=34,attnum=,chunk=)
> tuple xmax = 0 precedes relation relminmxid = 1
> (relname=corrupted,blkno=107,offnum=22,lp_off=7312,lp_flags=1,lp_len=34,attnum=,chunk=)
> tuple xmin = 305 precedes relation relfrozenxid = 487
> (relname=corrupted,blkno=107,offnum=22,lp_off=7312,lp_flags=1,lp_len=34,attnum=,chunk=)
> t_hoff > lp_len (54 > 34)
> (relname=corrupted,blkno=107,offnum=22,lp_off=7312,lp_flags=1,lp_len=34,attnum=,chunk=)
> t_hoff not max-aligned (54)
> TRAP: FailedAssertion("TransactionIdIsValid(xmax)", File: "heapam_visibility.c", Line: 1319)
> 0   postgres                            0x0000000105b22977 ExceptionalCondition + 103
> 1   postgres                            0x0000000105636e86 HeapTupleSatisfiesVacuum + 1158
> 2   postgres                            0x0000000105634aa1 heapam_index_build_range_scan + 1089
> 3   amcheck.so                          0x00000001060100f3 bt_index_check_internal + 3811
> 4   amcheck.so                          0x000000010601057c verify_btreeam + 316
> 5   postgres                            0x0000000105796266 ExecMakeTableFunctionResult + 422
> 6   postgres                            0x00000001057a8c35 FunctionNext + 101
> 7   postgres                            0x00000001057bbf3e ExecNestLoop + 478
> 8   postgres                            0x000000010578dc3d standard_ExecutorRun + 301
> 9   postgres                            0x00000001059100c5 PortalRunSelect + 389
> 10  postgres                            0x000000010590fc7f PortalRun + 527
> 11  postgres                            0x000000010590ed59 exec_simple_query + 1641
> 12  postgres                            0x000000010590c99d PostgresMain + 3661
> 13  postgres                            0x0000000105886a8a BackendRun + 410
> 14  postgres                            0x00000001058861c4 ServerLoop + 3044
> 15  postgres                            0x0000000105882fe9 PostmasterMain + 3769
> 16  postgres                            0x00000001057de3b0 help + 0
> 17  libdyld.dylib                       0x00007fff6725fcc9 start + 1
> pg_amcheck: error: query failed: server closed the connection unexpectedly
>         This probably means the server terminated abnormally
>         before or while processing the request.

I have just browsed through the patch and the idea is quite
interesting.  I think we can expand it to check that whether the flags
set in the infomask are sane or not w.r.t other flags and xid status.
Some examples are

- If HEAP_XMAX_LOCK_ONLY is set in infomask then HEAP_KEYS_UPDATED
should not be set in new_infomask2.
- If HEAP_XMIN(XMAX)_COMMITTED is set in the infomask then can we
actually cross verify the transaction status from the CLOG and check
whether is matching the hint bit or not.

While browsing through the code I could not find that we are doing
this kind of check,  ignore if we are already checking this.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Justin Pryzby
Дата:
Сообщение: pg_dump, gzwrite, and errno
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Recording test runtimes with the buildfarm