Обсуждение: BUG #16172: failure of vacuum file truncation can cause permanent data corruption
BUG #16172: failure of vacuum file truncation can cause permanent data corruption
От
PG Bug reporting form
Дата:
The following bug has been logged on the website: Bug reference: 16172 Logged by: TAKATSUKA Haruka Email address: harukat@sraoss.co.jp PostgreSQL version: 12.1 Operating system: Windows/Linux Description: Hello, pgsql hackers, I found that failure of vacuum file truncation can cause permanent data corruption. I am reporting the reproduce steps below. In Windows installation, the truncation sometime fails by permission denied error because of anti-virus software. It has caused just ERROR and people have offen dismissed it. Truncation failure can also make the standby panic with the following messages when replaying Heap2/VISIBLE or Heap2/CLEAN, because truncation wal is emitted even if it doesn't complete actually in the primary. WARNING: page .. of relation base/..../.... does not exist CONTEXT: WAL redo at ..... for ....: cutoff xid ... flags ... PANIC: WAL contains references to invalid pages I think truncation failure is to be handled as more severe level. Any thoughts? with best regards, Haruka Takatsuka / SRA OSS, Inc. Japan reproduce steps (PG12) ====================== $ psql -U postgres -d db1 Pager usage is off. psql (12.1) Type "help" for help. db1=# $ gdb -p {its backend process} (gdb) b FileTruncate Breakpoint 1 at 0x73d320: file fd.c, line 2057. (gdb) c Continuing. db1=# SHOW autovacuum; autovacuum ------------ off (1 row) db1=# CREATE TABLE t1 (id int primary key, v text); CREATE db1=# INSERT INTO t1 SELECT g, md5(g::text) FROM generate_series(1, 10000) as g; INSERT 0 10000 db1=# CHECKPOINT; Program received signal SIGUSR1, User defined signal 1. 0x00000036caae91a3 in __epoll_wait_nocancel () from /lib64/libc.so.6 (gdb) c Continuing. CHECKPOINT db1=# DELETE FROM t1 WHERE id > 50; DELETE 9950 db1=# VACUUM t1; Breakpoint 1, FileTruncate (file=59, offset=8192, wait_event_info=167772175) at fd.c:2057 2057 { (gdb) n 2065 returnCode = FileAccess(file); (gdb) n 2066 if (returnCode < 0) (gdb) p returnCode = -100 $6 = -100 (gdb) c Continuing. ERROR: could not truncate file "base/16384/16645" to 1 blocks: Success db1=# SELECT count(*) FROM t1; count ------- 9930 (1 row) db1=# SELECT * FROM t1 WHERE id > 50 LIMIT 10; id | v -----+---------------------------------- 121 | 4c56ff4ce4aaf9573aa5dff913df997a 122 | a0a080f42e6f13b3a2df133f073095dd 123 | 202cb962ac59075b964b07152d234b70 124 | c8ffe9a587b126f152ed3d89a146b445 125 | 3def184ad8f4755ff269862ea77393dd 126 | 069059b7ef840f0c74a814ec9237b6ec 127 | ec5decca5ed3d6b8079e2e7e7bacc9f2 128 | 76dc611d6ebaafc66cc0879c71b5db5c 129 | d1f491a404d6854880943e5c3cd9ca25 130 | 9b8619251a19057cff70779273e95aa6 (10 rows) db1=# VACUUM FULL FREEZE t1; VACUUM db1=# REINDEX TABLE t1; REINDEX db1=# SELECT count(*) FROM t1; count ------- 9930 (1 row) db1=# SELECT * FROM t1 WHERE id > 50 LIMIT 10; id | v -----+---------------------------------- 121 | 4c56ff4ce4aaf9573aa5dff913df997a 122 | a0a080f42e6f13b3a2df133f073095dd 123 | 202cb962ac59075b964b07152d234b70 124 | c8ffe9a587b126f152ed3d89a146b445 125 | 3def184ad8f4755ff269862ea77393dd 126 | 069059b7ef840f0c74a814ec9237b6ec 127 | ec5decca5ed3d6b8079e2e7e7bacc9f2 128 | 76dc611d6ebaafc66cc0879c71b5db5c 129 | d1f491a404d6854880943e5c3cd9ca25 130 | 9b8619251a19057cff70779273e95aa6 (10 rows) real incident log sample (from Windows PG10.x) ============================================== primary server: --------------- 2019-12-17 19:10:19 JST [14321] ERROR: 42501: could not truncate file "base/444807/447791" to 7 blocks: Permission denied 2019-12-17 19:10:19 JST [14321] CONTEXT: automatic vacuum of table "ddd.ssss.tttt" hot standby server: ------------------- 2019-12-17 19:11:20 JST [12345] WARNING: 01000: page 7 of relation base/444807/447791 does not exist 2019-12-17 19:11:20 JST [12345] CONTEXT: WAL redo at 4E/FC5AD060 for Heap2/VISIBLE: cutoff xid 0 flags 3 2019-12-17 19:11:20 JST [12345] PANIC: XX000: WAL contains references to invalid pages 2019-12-17 19:11:20 JST [12345] CONTEXT: WAL redo at 4E/FC5AD060 for Heap2/VISIBLE: cutoff xid 0 flags 3
Re: BUG #16172: failure of vacuum file truncation can causepermanent data corruption
От
TAKATSUKA Haruka
Дата:
I also tested PostgreSQL with the attached patch avoided this data corruption. The patch just removes DropRelFileNodeBuffers() from smgrtruncate(). On Thu, 19 Dec 2019 07:14:42 +0000 PG Bug reporting form <noreply@postgresql.org> wrote: > The following bug has been logged on the website: > > Bug reference: 16172 > Logged by: TAKATSUKA Haruka > Email address: harukat@sraoss.co.jp > PostgreSQL version: 12.1 > Operating system: Windows/Linux > Description: > > Hello, pgsql hackers, > > I found that failure of vacuum file truncation can cause permanent data > corruption. > I am reporting the reproduce steps below. > > In Windows installation, the truncation sometime fails by permission > denied error because of anti-virus software. It has caused just ERROR > and people have offen dismissed it. > > Truncation failure can also make the standby panic with the following > messages when replaying Heap2/VISIBLE or Heap2/CLEAN, because truncation > wal is emitted even if it doesn't complete actually in the primary. > > WARNING: page .. of relation base/..../.... does not exist > CONTEXT: WAL redo at ..... for ....: cutoff xid ... flags ... > PANIC: WAL contains references to invalid pages > > I think truncation failure is to be handled as more severe level. > Any thoughts? > > with best regards, > Haruka Takatsuka / SRA OSS, Inc. Japan > > > reproduce steps (PG12) > ====================== > > $ psql -U postgres -d db1 > Pager usage is off. > psql (12.1) > Type "help" for help. > > db1=# > > $ gdb -p {its backend process} > > (gdb) b FileTruncate > Breakpoint 1 at 0x73d320: file fd.c, line 2057. > (gdb) c > Continuing. > > db1=# SHOW autovacuum; > autovacuum > ------------ > off > (1 row) > > db1=# CREATE TABLE t1 (id int primary key, v text); > CREATE > > db1=# INSERT INTO t1 SELECT g, md5(g::text) FROM generate_series(1, 10000) > as g; > INSERT 0 10000 > > db1=# CHECKPOINT; > > Program received signal SIGUSR1, User defined signal 1. > 0x00000036caae91a3 in __epoll_wait_nocancel () from /lib64/libc.so.6 > (gdb) c > Continuing. > > CHECKPOINT > > db1=# DELETE FROM t1 WHERE id > 50; > DELETE 9950 > > db1=# VACUUM t1; > > Breakpoint 1, FileTruncate (file=59, offset=8192, > wait_event_info=167772175) > at fd.c:2057 > 2057 { > (gdb) n > 2065 returnCode = FileAccess(file); > (gdb) n > 2066 if (returnCode < 0) > (gdb) p returnCode = -100 > $6 = -100 > (gdb) c > Continuing. > > ERROR: could not truncate file "base/16384/16645" to 1 blocks: Success > > db1=# SELECT count(*) FROM t1; > count > ------- > 9930 > (1 row) > (snip)
Вложения
Re: BUG #16172: failure of vacuum file truncation can causepermanent data corruption
От
TAKATSUKA Haruka
Дата:
I found moving DropRelFileNodeBuffers() from top to end in function smgrtruncate() is a proper modification. It passed the regression test and this reproduction test. with best regards, Haruka Takatsuka / SRA OSS, Inc. Japan On Fri, 20 Dec 2019 10:19:52 +0900 TAKATSUKA Haruka <harukat@sraoss.co.jp> wrote: > I also tested PostgreSQL with the attached patch avoided this data > corruption. The patch just removes DropRelFileNodeBuffers() from > smgrtruncate(). > > > On Thu, 19 Dec 2019 07:14:42 +0000 > PG Bug reporting form <noreply@postgresql.org> wrote: > > > The following bug has been logged on the website: > > > > Bug reference: 16172 > > Logged by: TAKATSUKA Haruka > > Email address: harukat@sraoss.co.jp > > PostgreSQL version: 12.1 > > Operating system: Windows/Linux > > Description: > > > > Hello, pgsql hackers, > > > > I found that failure of vacuum file truncation can cause permanent data > > corruption. > > I am reporting the reproduce steps below. > > > > In Windows installation, the truncation sometime fails by permission > > denied error because of anti-virus software. It has caused just ERROR > > and people have offen dismissed it. > > > > Truncation failure can also make the standby panic with the following > > messages when replaying Heap2/VISIBLE or Heap2/CLEAN, because truncation > > wal is emitted even if it doesn't complete actually in the primary. > > > > WARNING: page .. of relation base/..../.... does not exist > > CONTEXT: WAL redo at ..... for ....: cutoff xid ... flags ... > > PANIC: WAL contains references to invalid pages > > > > I think truncation failure is to be handled as more severe level. > > Any thoughts? > > > > with best regards, > > Haruka Takatsuka / SRA OSS, Inc. Japan > > > > > > reproduce steps (PG12) > > ====================== > > > > $ psql -U postgres -d db1 > > Pager usage is off. > > psql (12.1) > > Type "help" for help. > > > > db1=# > > > > $ gdb -p {its backend process} > > > > (gdb) b FileTruncate > > Breakpoint 1 at 0x73d320: file fd.c, line 2057. > > (gdb) c > > Continuing. > > > > db1=# SHOW autovacuum; > > autovacuum > > ------------ > > off > > (1 row) > > > > db1=# CREATE TABLE t1 (id int primary key, v text); > > CREATE > > > > db1=# INSERT INTO t1 SELECT g, md5(g::text) FROM generate_series(1, 10000) > > as g; > > INSERT 0 10000 > > > > db1=# CHECKPOINT; > > > > Program received signal SIGUSR1, User defined signal 1. > > 0x00000036caae91a3 in __epoll_wait_nocancel () from /lib64/libc.so.6 > > (gdb) c > > Continuing. > > > > CHECKPOINT > > > > db1=# DELETE FROM t1 WHERE id > 50; > > DELETE 9950 > > > > db1=# VACUUM t1; > > > > Breakpoint 1, FileTruncate (file=59, offset=8192, > > wait_event_info=167772175) > > at fd.c:2057 > > 2057 { > > (gdb) n > > 2065 returnCode = FileAccess(file); > > (gdb) n > > 2066 if (returnCode < 0) > > (gdb) p returnCode = -100 > > $6 = -100 > > (gdb) c > > Continuing. > > > > ERROR: could not truncate file "base/16384/16645" to 1 blocks: Success > > > > db1=# SELECT count(*) FROM t1; > > count > > ------- > > 9930 > > (1 row) > > > (snip) ______________________________________________________________________ 高塚 遥 harukat@sraoss.co.jp SRA OSS, Inc. http://www.sraoss.co.jp 〒171-0022 東京都豊島区南池袋2-32-8 TEL: 03-5979-2701 FAX: 03-5979-2702 CellPhone: 080-1292-3396
Вложения
Re: BUG #16172: failure of vacuum file truncation can causepermanent data corruption
От
Kyotaro Horiguchi
Дата:
Hello, At Fri, 20 Dec 2019 11:00:28 +0900, TAKATSUKA Haruka <harukat@sraoss.co.jp> wrote in > > I found moving DropRelFileNodeBuffers() from top to end in function > smgrtruncate() is a proper modification. It passed the regression test > and this reproduction test. I don't recall clealy but I think there was a thread similar to this issue. Assume that checkpoint was running concurrently, the checkpoint can revive the just truncated blocks inadvertently with bogus content. reagards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: BUG #16172: failure of vacuum file truncation can causepermanent data corruption
От
TAKATSUKA Haruka
Дата:
Horiguchi-san Thanks for pointing out it. I wasn't looking for past arguments enough. You may mean this links are: https://www.postgresql.org/message-id/flat/2348.1544474335%40sss.pgh.pa.us https://www.postgresql.org/message-id/15667-8d3fca4eba25174f%40postgresql.org regards, TAKATSUKA Haruka On Fri, 20 Dec 2019 14:22:22 +0900 (JST) Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > Hello, > > At Fri, 20 Dec 2019 11:00:28 +0900, TAKATSUKA Haruka <harukat@sraoss.co.jp> wrote in > > > > I found moving DropRelFileNodeBuffers() from top to end in function > > smgrtruncate() is a proper modification. It passed the regression test > > and this reproduction test. > > I don't recall clealy but I think there was a thread similar to this > issue. Assume that checkpoint was running concurrently, the > checkpoint can revive the just truncated blocks inadvertently with > bogus content. > > reagards. > > -- > Kyotaro Horiguchi > NTT Open Source Software Center