Re: new heapcheck contrib module

Поиск
Список
Период
Сортировка
От Mark Dilger
Тема Re: new heapcheck contrib module
Дата
Msg-id 5F743835-3399-419C-8324-2D424237E999@enterprisedb.com
обсуждение исходный текст
Ответ на Re: new heapcheck contrib module  (Mark Dilger <mark.dilger@enterprisedb.com>)
Ответы Re: new heapcheck contrib module  (Peter Geoghegan <pg@bowt.ie>)
Re: new heapcheck contrib module  (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
Re: new heapcheck contrib module  (Dilip Kumar <dilipbalaut@gmail.com>)
Re: new heapcheck contrib module  (Mark Dilger <mark.dilger@enterprisedb.com>)
Список pgsql-hackers
Here is v5 of the patch.  Major changes in this version include:

1) A new module, pg_amcheck, which includes a command line client for checking a database or subset of a database.
Internallyit functions by querying the database for a list of tables which are appropriate given the command line
switches,and then calls amcheck's functions to validate each table and/or index.  The options for selecting/excluding
tablesand schemas is patterned on pg_dump, on the assumption that interface is already familiar to users. 

2) amcheck's btree checking functions have been refactored to be able to operate in two modes; the original mode in
whichall errors are reported via ereport, and a new mode for returning errors as rows from a set returning function.
Thenew mode is used by a new function verify_btreeam(), analogous to verify_heapam(), both of which are used by the
pg_amcheckcommand line tool. 

3) The regression test which generates corruption within a table uses the pageinspect module to determine the location
ofeach tuple on disk for corrupting.  This was suggested upthread. 

Testing on the command line shows that the pre-existing btree checking code could use some hardening, as it currently
crashesthe backend on certain corruptions.  When I corrupt relation files for tables and indexes in the backend and
thenuse pg_amcheck to check all objects in the database, I keep getting assertions from the btree checking code.  I
thinkI need to harden this code, but wanted to post an updated patch and solicit opinions before doing so.  Here are
someexample problems I'm seeing.  Note the stack trace when calling from the command line tool includes the new
verify_btreeamfunction, but you can get the same crashes using the old interface via psql: 

From psql, first error:

test=# select bt_index_parent_check('corrupted_idx', true, true);
TRAP: FailedAssertion("_bt_check_natts(rel, key->heapkeyspace, page, offnum)", File: "nbtsearch.c", Line: 663)
0   postgres                            0x0000000106872977 ExceptionalCondition + 103
1   postgres                            0x00000001063a33e2 _bt_compare + 1090
2   amcheck.so                          0x0000000106d62921 bt_target_page_check + 6033
3   amcheck.so                          0x0000000106d5fd2f bt_index_check_internal + 2847
4   amcheck.so                          0x0000000106d60433 bt_index_parent_check + 67
5   postgres                            0x00000001064d6762 ExecInterpExpr + 1634
6   postgres                            0x000000010650d071 ExecResult + 321
7   postgres                            0x00000001064ddc3d standard_ExecutorRun + 301
8   postgres                            0x00000001066600c5 PortalRunSelect + 389
9   postgres                            0x000000010665fc7f PortalRun + 527
10  postgres                            0x000000010665ed59 exec_simple_query + 1641
11  postgres                            0x000000010665c99d PostgresMain + 3661
12  postgres                            0x00000001065d6a8a BackendRun + 410
13  postgres                            0x00000001065d61c4 ServerLoop + 3044
14  postgres                            0x00000001065d2fe9 PostmasterMain + 3769
15  postgres                            0x000000010652e3b0 help + 0
16  libdyld.dylib                       0x00007fff6725fcc9 start + 1
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: 2020-05-11 10:11:47.394 PDT [41091] LOG:  server process (PID
41309)was terminated by signal 6: Abort trap: 6 



From commandline, second error:

pgtest % pg_amcheck -i test
(relname=corrupted,blkno=0,offnum=16,lp_off=7680,lp_flags=1,lp_len=31,attnum=,chunk=)
tuple xmin = 3289393 is in the future
(relname=corrupted,blkno=0,offnum=17,lp_off=7648,lp_flags=1,lp_len=31,attnum=,chunk=)
tuple xmax = 0 precedes relation relminmxid = 1
(relname=corrupted,blkno=0,offnum=17,lp_off=7648,lp_flags=1,lp_len=31,attnum=,chunk=)
tuple xmin = 12593 is in the future
(relname=corrupted,blkno=0,offnum=17,lp_off=7648,lp_flags=1,lp_len=31,attnum=,chunk=)

<snip>

(relname=corrupted,blkno=107,offnum=20,lp_off=7392,lp_flags=1,lp_len=34,attnum=,chunk=)
tuple xmin = 306 precedes relation relfrozenxid = 487
(relname=corrupted,blkno=107,offnum=22,lp_off=7312,lp_flags=1,lp_len=34,attnum=,chunk=)
tuple xmax = 0 precedes relation relminmxid = 1
(relname=corrupted,blkno=107,offnum=22,lp_off=7312,lp_flags=1,lp_len=34,attnum=,chunk=)
tuple xmin = 305 precedes relation relfrozenxid = 487
(relname=corrupted,blkno=107,offnum=22,lp_off=7312,lp_flags=1,lp_len=34,attnum=,chunk=)
t_hoff > lp_len (54 > 34)
(relname=corrupted,blkno=107,offnum=22,lp_off=7312,lp_flags=1,lp_len=34,attnum=,chunk=)
t_hoff not max-aligned (54)
TRAP: FailedAssertion("TransactionIdIsValid(xmax)", File: "heapam_visibility.c", Line: 1319)
0   postgres                            0x0000000105b22977 ExceptionalCondition + 103
1   postgres                            0x0000000105636e86 HeapTupleSatisfiesVacuum + 1158
2   postgres                            0x0000000105634aa1 heapam_index_build_range_scan + 1089
3   amcheck.so                          0x00000001060100f3 bt_index_check_internal + 3811
4   amcheck.so                          0x000000010601057c verify_btreeam + 316
5   postgres                            0x0000000105796266 ExecMakeTableFunctionResult + 422
6   postgres                            0x00000001057a8c35 FunctionNext + 101
7   postgres                            0x00000001057bbf3e ExecNestLoop + 478
8   postgres                            0x000000010578dc3d standard_ExecutorRun + 301
9   postgres                            0x00000001059100c5 PortalRunSelect + 389
10  postgres                            0x000000010590fc7f PortalRun + 527
11  postgres                            0x000000010590ed59 exec_simple_query + 1641
12  postgres                            0x000000010590c99d PostgresMain + 3661
13  postgres                            0x0000000105886a8a BackendRun + 410
14  postgres                            0x00000001058861c4 ServerLoop + 3044
15  postgres                            0x0000000105882fe9 PostmasterMain + 3769
16  postgres                            0x00000001057de3b0 help + 0
17  libdyld.dylib                       0x00007fff6725fcc9 start + 1
pg_amcheck: error: query failed: server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alvaro Herrera
Дата:
Сообщение: Re: 2pc leaks fds
Следующее
От: Alexander Korotkov
Дата:
Сообщение: Re: PG 13 release notes, first draft