Обсуждение: BUG #15745: WAL References Invalid Pages...that eventually resolves

Поиск
Список
Период
Сортировка

BUG #15745: WAL References Invalid Pages...that eventually resolves

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      15745
Logged by:          Daniel Farina
Email address:      daniel@citusdata.com
PostgreSQL version: 11.1
Operating system:   Amazon Linux 2018.03
Description:

We have here an interesting case of a panic during recovery from backup with
"WAL References invalid pages." Normally, this is viewed as an issue of
corruption.

But, for serendipitous reasons, I let this one run for a while. As it turns
out, with each crash, it would make *slightly* more progress than the time
before....and then eventually, it suffered no more faults and caught up
normally. Included is a log that shows how sparse these faults were,
relative to all the traffic going on....: roughly two per segment on this
workload, with large gaps between problematic segments, and not necessarily
repetition in a problematic relation or filenode.

The fact the standby eventually came up made me suspicious, so I ran amcheck
with a heap re-check, and, no tuples were in violation.

Included is a log, which shows how the system recovered over and over,
making slight progress each time. This is the entire inventory after such
crashes: after these, the system passed amcheck and appears to work
normally.

postgresql-Mon.log-2019-04-08 00:08:22.619 UTC [3323][1/0] : [130-1]
WARNING:  page 162136064 of relation base/16385/21372 does not exist
postgresql-Mon.log-2019-04-08 00:08:22.619 UTC [3323][1/0] : [131-1]
CONTEXT:  WAL redo at 9AA/F94F2700 for Btree/DELETE: 2 items
postgresql-Mon.log:2019-04-08 00:08:22.619 UTC [3323][1/0] : [132-1] PANIC:
WAL contains references to invalid pages
postgresql-Mon.log-2019-04-08 00:08:22.619 UTC [3323][1/0] : [133-1]
CONTEXT:  WAL redo at 9AA/F94F2700 for Btree/DELETE: 2 items
--
postgresql-Mon.log-2019-04-08 00:17:31.686 UTC [3277][1/0] : [106-1] LOG:
restored log file "00000006000009AA000000E2" from archive
postgresql-Mon.log-2019-04-08 00:17:31.741 UTC [3277][1/0] : [107-1]
WARNING:  page 162136064 of relation base/16385/22606 does not exist
postgresql-Mon.log-2019-04-08 00:17:31.741 UTC [3277][1/0] : [108-1]
CONTEXT:  WAL redo at 9AA/E269B0F0 for Btree/DELETE: 1 items
postgresql-Mon.log:2019-04-08 00:17:31.742 UTC [3277][1/0] : [109-1] PANIC:
WAL contains references to invalid pages
postgresql-Mon.log-2019-04-08 00:17:31.742 UTC [3277][1/0] : [110-1]
CONTEXT:  WAL redo at 9AA/E269B0F0 for Btree/DELETE: 1 items
postgresql-Mon.log-2019-04-08 00:17:31.815 UTC [3274][] : [11-1] LOG:
startup process (PID 3277) was terminated by signal 6: Aborted
postgresql-Mon.log-2019-04-08 00:17:31.815 UTC [3274][] : [12-1] LOG:
terminating any other active server processes
--
postgresql-Mon.log-2019-04-08 00:26:17.865 UTC [3288][1/0] : [149-1]
WARNING:  page 1936013626 of relation base/16385/22536 does not exist
postgresql-Mon.log-2019-04-08 00:26:17.865 UTC [3288][1/0] : [150-1]
CONTEXT:  WAL redo at 9AB/C21CA50 for Btree/DELETE: 35 items
postgresql-Mon.log:2019-04-08 00:26:17.865 UTC [3288][1/0] : [151-1] PANIC:
WAL contains references to invalid pages
postgresql-Mon.log-2019-04-08 00:26:17.865 UTC [3288][1/0] : [152-1]
CONTEXT:  WAL redo at 9AB/C21CA50 for Btree/DELETE: 35 items
postgresql-Mon.log-2019-04-08 00:26:17.943 UTC [3285][] : [11-1] LOG:
startup process (PID 3288) was terminated by signal 6: Aborted
postgresql-Mon.log-2019-04-08 00:26:17.943 UTC [3285][] : [12-1] LOG:
terminating any other active server processes
--
postgresql-Mon.log-2019-04-08 00:36:01.466 UTC [3330][1/0] : [118-1] LOG:
restored log file "00000006000009AA000000EE" from archive
postgresql-Mon.log-2019-04-08 00:36:01.497 UTC [3330][1/0] : [119-1]
WARNING:  page 1432945756 of relation base/16385/21365 does not exist
postgresql-Mon.log-2019-04-08 00:36:01.497 UTC [3330][1/0] : [120-1]
CONTEXT:  WAL redo at 9AA/EE183EC0 for Btree/DELETE: 43 items
postgresql-Mon.log:2019-04-08 00:36:01.497 UTC [3330][1/0] : [121-1] PANIC:
WAL contains references to invalid pages
postgresql-Mon.log-2019-04-08 00:36:01.497 UTC [3330][1/0] : [122-1]
CONTEXT:  WAL redo at 9AA/EE183EC0 for Btree/DELETE: 43 items
postgresql-Mon.log-2019-04-08 00:36:01.536 UTC [5403][] : [1-1]
[app=[unknown]] LOG:  connection received: host=172.16.101.249 port=48250
postgresql-Mon.log-2019-04-08 00:36:01.574 UTC [3327][] : [11-1] LOG:
startup process (PID 3330) was terminated by signal 6: Aborted
--
postgresql-Mon.log-2019-04-08 00:43:37.086 UTC [3338][1/0] : [121-1] LOG:
restored log file "00000006000009AA000000F1" from archive
postgresql-Mon.log-2019-04-08 00:43:37.134 UTC [3338][1/0] : [122-1]
WARNING:  page 825766448 of relation base/16385/21358 does not exist
postgresql-Mon.log-2019-04-08 00:43:37.134 UTC [3338][1/0] : [123-1]
CONTEXT:  WAL redo at 9AA/F12FC1D0 for Btree/DELETE: 25 items
postgresql-Mon.log:2019-04-08 00:43:37.134 UTC [3338][1/0] : [124-1] PANIC:
WAL contains references to invalid pages
postgresql-Mon.log-2019-04-08 00:43:37.134 UTC [3338][1/0] : [125-1]
CONTEXT:  WAL redo at 9AA/F12FC1D0 for Btree/DELETE: 25 items
postgresql-Mon.log-2019-04-08 00:43:37.211 UTC [3335][] : [11-1] LOG:
startup process (PID 3338) was terminated by signal 6: Aborted
postgresql-Mon.log-2019-04-08 00:43:37.211 UTC [3335][] : [12-1] LOG:
terminating any other active server processes
--
postgresql-Mon.log-2019-04-08 00:52:30.100 UTC [3249][1/0] : [143-1]
WARNING:  page 162201600 of relation base/16385/22456 does not exist
postgresql-Mon.log-2019-04-08 00:52:30.100 UTC [3249][1/0] : [144-1]
CONTEXT:  WAL redo at 9AB/6F53290 for Btree/DELETE: 3 items
postgresql-Mon.log:2019-04-08 00:52:30.100 UTC [3249][1/0] : [145-1] PANIC:
WAL contains references to invalid pages
postgresql-Mon.log-2019-04-08 00:52:30.100 UTC [3249][1/0] : [146-1]
CONTEXT:  WAL redo at 9AB/6F53290 for Btree/DELETE: 3 items
postgresql-Mon.log-2019-04-08 00:52:30.173 UTC [3246][] : [11-1] LOG:
startup process (PID 3249) was terminated by signal 6: Aborted
postgresql-Mon.log-2019-04-08 00:52:30.173 UTC [3246][] : [12-1] LOG:
terminating any other active server processes
--
postgresql-Mon.log-2019-04-08 01:01:56.487 UTC [3315][1/0] : [125-1] LOG:
restored log file "00000006000009AA000000F5" from archive
postgresql-Mon.log-2019-04-08 01:01:56.545 UTC [3315][1/0] : [126-1]
WARNING:  page 162136064 of relation base/16385/22456 does not exist
postgresql-Mon.log-2019-04-08 01:01:56.545 UTC [3315][1/0] : [127-1]
CONTEXT:  WAL redo at 9AA/F57AF340 for Btree/DELETE: 1 items
postgresql-Mon.log:2019-04-08 01:01:56.545 UTC [3315][1/0] : [128-1] PANIC:
WAL contains references to invalid pages
postgresql-Mon.log-2019-04-08 01:01:56.545 UTC [3315][1/0] : [129-1]
CONTEXT:  WAL redo at 9AA/F57AF340 for Btree/DELETE: 1 items
--
postgresql-Mon.log-2019-04-08 01:10:55.964 UTC [3274][1/0] : [98-1] LOG:
restored log file "00000006000009AA000000DA" from archive
postgresql-Mon.log-2019-04-08 01:10:56.026 UTC [3274][1/0] : [99-1] WARNING:
 page 162201600 of relation base/16385/21379 does not exist
postgresql-Mon.log-2019-04-08 01:10:56.026 UTC [3274][1/0] : [100-1]
CONTEXT:  WAL redo at 9AA/DA735378 for Btree/DELETE: 1 items
postgresql-Mon.log:2019-04-08 01:10:56.026 UTC [3274][1/0] : [101-1] PANIC:
WAL contains references to invalid pages
postgresql-Mon.log-2019-04-08 01:10:56.026 UTC [3274][1/0] : [102-1]
CONTEXT:  WAL redo at 9AA/DA735378 for Btree/DELETE: 1 items
postgresql-Mon.log-2019-04-08 01:10:56.099 UTC [3271][] : [11-1] LOG:
startup process (PID 3274) was terminated by signal 6: Aborted
postgresql-Mon.log-2019-04-08 01:10:56.099 UTC [3271][] : [12-1] LOG:
terminating any other active server processes
--
postgresql-Mon.log-2019-04-08 01:21:03.013 UTC [3326][1/0] : [266-1] LOG:
restored log file "00000006000009AB00000082" from archive
postgresql-Mon.log-2019-04-08 01:21:03.053 UTC [3326][1/0] : [267-1]
WARNING:  page 1886284143 of relation base/16385/21291 does not exist
postgresql-Mon.log-2019-04-08 01:21:03.053 UTC [3326][1/0] : [268-1]
CONTEXT:  WAL redo at 9AB/823B9360 for Btree/DELETE: 12 items
postgresql-Mon.log:2019-04-08 01:21:03.053 UTC [3326][1/0] : [269-1] PANIC:
WAL contains references to invalid pages
postgresql-Mon.log-2019-04-08 01:21:03.053 UTC [3326][1/0] : [270-1]
CONTEXT:  WAL redo at 9AB/823B9360 for Btree/DELETE: 12 items
postgresql-Mon.log-2019-04-08 01:21:03.134 UTC [3323][] : [11-1] LOG:
startup process (PID 3326) was terminated by signal 6: Aborted
postgresql-Mon.log-2019-04-08 01:21:03.134 UTC [3323][] : [12-1] LOG:
terminating any other active server processes
--
postgresql-Mon.log-2019-04-08 01:30:46.438 UTC [3333][1/0] : [95-1] WARNING:
 page 1934703203 of relation base/16385/21306 does not exist
postgresql-Mon.log-2019-04-08 01:30:46.438 UTC [3333][1/0] : [96-1] CONTEXT:
 WAL redo at 9AA/D6404470 for Btree/DELETE: 9 items
postgresql-Mon.log:2019-04-08 01:30:46.438 UTC [3333][1/0] : [97-1] PANIC:
WAL contains references to invalid pages
postgresql-Mon.log-2019-04-08 01:30:46.438 UTC [3333][1/0] : [98-1] CONTEXT:
 WAL redo at 9AA/D6404470 for Btree/DELETE: 9 items
postgresql-Mon.log-2019-04-08 01:30:46.519 UTC [3330][] : [11-1] LOG:
startup process (PID 3333) was terminated by signal 6: Aborted
postgresql-Mon.log-2019-04-08 01:30:46.519 UTC [3330][] : [12-1] LOG:
terminating any other active server processes
--
postgresql-Sun.log-2019-04-07 22:43:15.774 UTC [17013][1/0] : [121-1] LOG:
restored log file "00000006000009AA000000F1" from archive
postgresql-Sun.log-2019-04-07 22:43:15.824 UTC [17013][1/0] : [122-1]
WARNING:  page 825766448 of relation base/16385/21358 does not exist
postgresql-Sun.log-2019-04-07 22:43:15.824 UTC [17013][1/0] : [123-1]
CONTEXT:  WAL redo at 9AA/F12FC1D0 for Btree/DELETE: 25 items
postgresql-Sun.log:2019-04-07 22:43:15.824 UTC [17013][1/0] : [124-1] PANIC:
 WAL contains references to invalid pages
postgresql-Sun.log-2019-04-07 22:43:15.824 UTC [17013][1/0] : [125-1]
CONTEXT:  WAL redo at 9AA/F12FC1D0 for Btree/DELETE: 25 items
postgresql-Sun.log-2019-04-07 22:43:15.904 UTC [17010][] : [11-1] LOG:
startup process (PID 17013) was terminated by signal 6: Aborted
postgresql-Sun.log-2019-04-07 22:43:15.904 UTC [17010][] : [12-1] LOG:
terminating any other active server processes
--
postgresql-Sun.log-2019-04-07 22:44:15.699 UTC [19068][1/0] : [144-1] LOG:
restored log file "00000006000009AB00000008" from archive
postgresql-Sun.log-2019-04-07 22:44:15.737 UTC [19068][1/0] : [145-1]
WARNING:  page 162201600 of relation base/16385/21365 does not exist
postgresql-Sun.log-2019-04-07 22:44:15.737 UTC [19068][1/0] : [146-1]
CONTEXT:  WAL redo at 9AB/854FF48 for Btree/DELETE: 2 items
postgresql-Sun.log:2019-04-07 22:44:15.737 UTC [19068][1/0] : [147-1] PANIC:
 WAL contains references to invalid pages
postgresql-Sun.log-2019-04-07 22:44:15.737 UTC [19068][1/0] : [148-1]
CONTEXT:  WAL redo at 9AB/854FF48 for Btree/DELETE: 2 items
postgresql-Sun.log-2019-04-07 22:44:15.827 UTC [19065][] : [11-1] LOG:
startup process (PID 19068) was terminated by signal 6: Aborted
postgresql-Sun.log-2019-04-07 22:44:15.827 UTC [19065][] : [12-1] LOG:
terminating any other active server processes
--
postgresql-Sun.log-2019-04-07 22:51:58.298 UTC [3306][1/0] : [121-1] LOG:
restored log file "00000006000009AA000000F1" from archive
postgresql-Sun.log-2019-04-07 22:51:58.342 UTC [3306][1/0] : [122-1]
WARNING:  page 825766448 of relation base/16385/21358 does not exist
postgresql-Sun.log-2019-04-07 22:51:58.342 UTC [3306][1/0] : [123-1]
CONTEXT:  WAL redo at 9AA/F12FC1D0 for Btree/DELETE: 25 items
postgresql-Sun.log:2019-04-07 22:51:58.342 UTC [3306][1/0] : [124-1] PANIC:
WAL contains references to invalid pages
postgresql-Sun.log-2019-04-07 22:51:58.342 UTC [3306][1/0] : [125-1]
CONTEXT:  WAL redo at 9AA/F12FC1D0 for Btree/DELETE: 25 items
postgresql-Sun.log-2019-04-07 22:51:58.420 UTC [3303][] : [11-1] LOG:
startup process (PID 3306) was terminated by signal 6: Aborted
postgresql-Sun.log-2019-04-07 22:51:58.421 UTC [3303][] : [12-1] LOG:
terminating any other active server processes
--
postgresql-Sun.log-2019-04-07 23:02:20.964 UTC [5948][] : [1-1]
[app=[unknown]] LOG:  connection received: host=3.88.80.230 port=57984
postgresql-Sun.log-2019-04-07 23:02:21.041 UTC [3298][1/0] : [155-1]
WARNING:  page 825439802 of relation base/16385/21231 does not exist
postgresql-Sun.log-2019-04-07 23:02:21.041 UTC [3298][1/0] : [156-1]
CONTEXT:  WAL redo at 9AB/12A04DA0 for Btree/DELETE: 14 items
postgresql-Sun.log:2019-04-07 23:02:21.041 UTC [3298][1/0] : [157-1] PANIC:
WAL contains references to invalid pages
postgresql-Sun.log-2019-04-07 23:02:21.041 UTC [3298][1/0] : [158-1]
CONTEXT:  WAL redo at 9AB/12A04DA0 for Btree/DELETE: 14 items
postgresql-Sun.log-2019-04-07 23:02:21.117 UTC [3295][] : [11-1] LOG:
startup process (PID 3298) was terminated by signal 6: Aborted
postgresql-Sun.log-2019-04-07 23:02:21.117 UTC [3295][] : [12-1] LOG:
terminating any other active server processes
--
postgresql-Sun.log-2019-04-07 23:12:15.077 UTC [3305][1/0] : [112-1]
WARNING:  page 162136064 of relation base/16385/21351 does not exist
postgresql-Sun.log-2019-04-07 23:12:15.077 UTC [3305][1/0] : [113-1]
CONTEXT:  WAL redo at 9AA/E7AF0838 for Btree/DELETE: 4 items
postgresql-Sun.log:2019-04-07 23:12:15.077 UTC [3305][1/0] : [114-1] PANIC:
WAL contains references to invalid pages
postgresql-Sun.log-2019-04-07 23:12:15.077 UTC [3305][1/0] : [115-1]
CONTEXT:  WAL redo at 9AA/E7AF0838 for Btree/DELETE: 4 items
postgresql-Sun.log-2019-04-07 23:12:15.133 UTC [5203][] : [1-1]
[app=[unknown]] LOG:  connection received: host=172.16.101.249 port=60556
postgresql-Sun.log-2019-04-07 23:12:15.149 UTC [3302][] : [11-1] LOG:
startup process (PID 3305) was terminated by signal 6: Aborted
--
postgresql-Sun.log-2019-04-07 23:21:02.554 UTC [5377][] : [1-1]
[app=[unknown]] LOG:  connection received: host=172.16.101.249 port=49380
postgresql-Sun.log-2019-04-07 23:21:02.588 UTC [3258][1/0] : [130-1]
WARNING:  page 162136064 of relation base/16385/21372 does not exist
postgresql-Sun.log-2019-04-07 23:21:02.588 UTC [3258][1/0] : [131-1]
CONTEXT:  WAL redo at 9AA/F94F2700 for Btree/DELETE: 2 items
postgresql-Sun.log:2019-04-07 23:21:02.588 UTC [3258][1/0] : [132-1] PANIC:
WAL contains references to invalid pages
postgresql-Sun.log-2019-04-07 23:21:02.588 UTC [3258][1/0] : [133-1]
CONTEXT:  WAL redo at 9AA/F94F2700 for Btree/DELETE: 2 items
postgresql-Sun.log-2019-04-07 23:21:02.663 UTC [3255][] : [11-1] LOG:
startup process (PID 3258) was terminated by signal 6: Aborted
postgresql-Sun.log-2019-04-07 23:21:02.663 UTC [3255][] : [12-1] LOG:
terminating any other active server processes
--
postgresql-Sun.log-2019-04-07 23:30:34.035 UTC [3272][1/0] : [135-1]
WARNING:  page 162136064 of relation base/16385/21365 does not exist
postgresql-Sun.log-2019-04-07 23:30:34.035 UTC [3272][1/0] : [136-1]
CONTEXT:  WAL redo at 9AA/FE2A8BE8 for Btree/DELETE: 2 items
postgresql-Sun.log:2019-04-07 23:30:34.035 UTC [3272][1/0] : [137-1] PANIC:
WAL contains references to invalid pages
postgresql-Sun.log-2019-04-07 23:30:34.035 UTC [3272][1/0] : [138-1]
CONTEXT:  WAL redo at 9AA/FE2A8BE8 for Btree/DELETE: 2 items
postgresql-Sun.log-2019-04-07 23:30:34.111 UTC [3269][] : [11-1] LOG:
startup process (PID 3272) was terminated by signal 6: Aborted
postgresql-Sun.log-2019-04-07 23:30:34.111 UTC [3269][] : [12-1] LOG:
terminating any other active server processes
--
postgresql-Sun.log-2019-04-07 23:40:31.513 UTC [3323][1/0] : [187-1]
WARNING:  page 2710832221 of relation base/16385/21393 does not exist
postgresql-Sun.log-2019-04-07 23:40:31.513 UTC [3323][1/0] : [188-1]
CONTEXT:  WAL redo at 9AB/32811B38 for Btree/DELETE: 31 items
postgresql-Sun.log:2019-04-07 23:40:31.513 UTC [3323][1/0] : [189-1] PANIC:
WAL contains references to invalid pages
postgresql-Sun.log-2019-04-07 23:40:31.513 UTC [3323][1/0] : [190-1]
CONTEXT:  WAL redo at 9AB/32811B38 for Btree/DELETE: 31 items
postgresql-Sun.log-2019-04-07 23:40:31.589 UTC [3320][] : [11-1] LOG:
startup process (PID 3323) was terminated by signal 6: Aborted
postgresql-Sun.log-2019-04-07 23:40:31.589 UTC [3320][] : [12-1] LOG:
terminating any other active server processes
--
postgresql-Sun.log-2019-04-07 23:48:15.992 UTC [3290][1/0] : [89-1] LOG:
restored log file "00000006000009AA000000D1" from archive
postgresql-Sun.log-2019-04-07 23:48:16.099 UTC [3290][1/0] : [90-1] WARNING:
 page 162136064 of relation base/16385/21351 does not exist
postgresql-Sun.log-2019-04-07 23:48:16.099 UTC [3290][1/0] : [91-1] CONTEXT:
 WAL redo at 9AA/D1D92508 for Btree/DELETE: 8 items
postgresql-Sun.log:2019-04-07 23:48:16.099 UTC [3290][1/0] : [92-1] PANIC:
WAL contains references to invalid pages
postgresql-Sun.log-2019-04-07 23:48:16.099 UTC [3290][1/0] : [93-1] CONTEXT:
 WAL redo at 9AA/D1D92508 for Btree/DELETE: 8 items
postgresql-Sun.log-2019-04-07 23:48:16.176 UTC [3287][] : [11-1] LOG:
startup process (PID 3290) was terminated by signal 6: Aborted
postgresql-Sun.log-2019-04-07 23:48:16.176 UTC [3287][] : [12-1] LOG:
terminating any other active server processes
--
postgresql-Sun.log-2019-04-07 23:58:31.375 UTC [3307][1/0] : [89-1] LOG:
restored log file "00000006000009AA000000D1" from archive
postgresql-Sun.log-2019-04-07 23:58:31.476 UTC [3307][1/0] : [90-1] WARNING:
 page 162136064 of relation base/16385/21351 does not exist
postgresql-Sun.log-2019-04-07 23:58:31.476 UTC [3307][1/0] : [91-1] CONTEXT:
 WAL redo at 9AA/D1D92508 for Btree/DELETE: 8 items
postgresql-Sun.log:2019-04-07 23:58:31.476 UTC [3307][1/0] : [92-1] PANIC:
WAL contains references to invalid pages
postgresql-Sun.log-2019-04-07 23:58:31.476 UTC [3307][1/0] : [93-1] CONTEXT:
 WAL redo at 9AA/D1D92508 for Btree/DELETE: 8 items
postgresql-Sun.log-2019-04-07 23:58:31.552 UTC [3304][] : [11-1] LOG:
startup process (PID 3307) was terminated by signal 6: Aborted


Re: BUG #15745: WAL References Invalid Pages...that eventually resolves

От
Peter Geoghegan
Дата:
Hi Daniel,

On Tue, Apr 9, 2019 at 1:30 PM PG Bug reporting form
<noreply@postgresql.org> wrote:
> But, for serendipitous reasons, I let this one run for a while. As it turns
> out, with each crash, it would make *slightly* more progress than the time
> before....and then eventually, it suffered no more faults and caught up
> normally. Included is a log that shows how sparse these faults were,
> relative to all the traffic going on....: roughly two per segment on this
> workload, with large gaps between problematic segments, and not necessarily
> repetition in a problematic relation or filenode.

That sounds weird.

> The fact the standby eventually came up made me suspicious, so I ran amcheck
> with a heap re-check, and, no tuples were in violation.
>
> Included is a log, which shows how the system recovered over and over,
> making slight progress each time. This is the entire inventory after such
> crashes: after these, the system passed amcheck and appears to work
> normally.

Did you try bt_index_parent_check('rel', true)? You might want to make
sure that work_mem is set sufficiently high so that the
downlink-block-is-present check is definitely effective; work_mem
bounds the size of a Bloom filter used by the implementation (the heap
verification option has its own Bloom filter, bound by
maintenance_work_mem). Suggest that you "set
client_min_messages=debug1" before running amcheck this way, just in
case that shows something interesting.

> postgresql-Mon.log-2019-04-08 00:08:22.619 UTC [3323][1/0] : [130-1]
> WARNING:  page 162136064 of relation base/16385/21372 does not exist

These WARNING messages all reference block numbers that look like
32-bits of random garbage, but could be from a very large relation.

The relevant WAL record is from B-Tree's opportunistic LP_DEAD garbage
collection (not VACUUM). Note that Andres changed this mechanism for
v12, so that latestRemovedXid was calculated on the primary, rather
than on the standby. I think that this error comes from
btree_xlog_delete_get_latestRemovedXid(), which is in 11 but not
master/12.

I wonder, is "base/16385/21351" the index or the table? Is it possible
to run pg_waldump? I think it's the table.

If the problem is in btree_xlog_delete_get_latestRemovedXid(), then it
is perhaps unsurprising that there isn't evidence of any lasting
corruption.

-- 
Peter Geoghegan



Re: BUG #15745: WAL References Invalid Pages...that eventually resolves

От
Daniel Farina
Дата:


On Sat, Apr 27, 2019 at 8:28 PM Peter Geoghegan <pg@bowt.ie> wrote:
Hi Daniel,

On Tue, Apr 9, 2019 at 1:30 PM PG Bug reporting form
<noreply@postgresql.org> wrote:
> But, for serendipitous reasons, I let this one run for a while. As it turns
> out, with each crash, it would make *slightly* more progress than the time
> before....and then eventually, it suffered no more faults and caught up
> normally. Included is a log that shows how sparse these faults were,
> relative to all the traffic going on....: roughly two per segment on this
> workload, with large gaps between problematic segments, and not necessarily
> repetition in a problematic relation or filenode.

That sounds weird.

Yeah. It was.
 
> The fact the standby eventually came up made me suspicious, so I ran amcheck
> with a heap re-check, and, no tuples were in violation.
>
> Included is a log, which shows how the system recovered over and over,
> making slight progress each time. This is the entire inventory after such
> crashes: after these, the system passed amcheck and appears to work
> normally.

Did you try bt_index_parent_check('rel', true)? You might want to make
sure that work_mem is set sufficiently high so that the
downlink-block-is-present check is definitely effective; work_mem
bounds the size of a Bloom filter used by the implementation (the heap
verification option has its own Bloom filter, bound by
maintenance_work_mem). Suggest that you "set
client_min_messages=debug1" before running amcheck this way, just in
case that shows something interesting.

No, I didn't want to take that lock...but I'll keep in it mind for next time, though I'll have to make arrangements.

> postgresql-Mon.log-2019-04-08 00:08:22.619 UTC [3323][1/0] : [130-1]
> WARNING:  page 162136064 of relation base/16385/21372 does not exist

These WARNING messages all reference block numbers that look like
32-bits of random garbage, but could be from a very large relation.

Definitely not a 1.3TiB relation. Good eye.

The relevant WAL record is from B-Tree's opportunistic LP_DEAD garbage
collection (not VACUUM). Note that Andres changed this mechanism for
v12, so that latestRemovedXid was calculated on the primary, rather
than on the standby. I think that this error comes from
btree_xlog_delete_get_latestRemovedXid(), which is in 11 but not
master/12.

I wonder, is "base/16385/21351" the index or the table? Is it possible
to run pg_waldump? I think it's the table.

If the problem is in btree_xlog_delete_get_latestRemovedXid(), then it
is perhaps unsurprising that there isn't evidence of any lasting
corruption.

 Regrettably, it's too late, but I'll have my notes for next time.

Re: BUG #15745: WAL References Invalid Pages...that eventuallyresolves

От
Michael Paquier
Дата:
On Sun, May 05, 2019 at 08:37:02PM -0700, Daniel Farina wrote:
>  Regrettably, it's too late, but I'll have my notes for next time.

Here is a guess: the base backup which has been used for this
instance's recovery has been taken from a standby, where the
minimum recovery LSN registered in its control file did not actually
map to the last physical LSN where a page has been flushed by either
the checkpointer, the startup process or a read-only backend.

This reminds me of the problems between 3c64dcb and c186ba1 I was at
the origin of, but this refers to 11.1, which was released after the
fact.  Or perhaps a beta version of 11 has been used?
--
Michael

Вложения