Обсуждение: BUG #15745: WAL References Invalid Pages...that eventually resolves
The following bug has been logged on the website: Bug reference: 15745 Logged by: Daniel Farina Email address: daniel@citusdata.com PostgreSQL version: 11.1 Operating system: Amazon Linux 2018.03 Description: We have here an interesting case of a panic during recovery from backup with "WAL References invalid pages." Normally, this is viewed as an issue of corruption. But, for serendipitous reasons, I let this one run for a while. As it turns out, with each crash, it would make *slightly* more progress than the time before....and then eventually, it suffered no more faults and caught up normally. Included is a log that shows how sparse these faults were, relative to all the traffic going on....: roughly two per segment on this workload, with large gaps between problematic segments, and not necessarily repetition in a problematic relation or filenode. The fact the standby eventually came up made me suspicious, so I ran amcheck with a heap re-check, and, no tuples were in violation. Included is a log, which shows how the system recovered over and over, making slight progress each time. This is the entire inventory after such crashes: after these, the system passed amcheck and appears to work normally. postgresql-Mon.log-2019-04-08 00:08:22.619 UTC [3323][1/0] : [130-1] WARNING: page 162136064 of relation base/16385/21372 does not exist postgresql-Mon.log-2019-04-08 00:08:22.619 UTC [3323][1/0] : [131-1] CONTEXT: WAL redo at 9AA/F94F2700 for Btree/DELETE: 2 items postgresql-Mon.log:2019-04-08 00:08:22.619 UTC [3323][1/0] : [132-1] PANIC: WAL contains references to invalid pages postgresql-Mon.log-2019-04-08 00:08:22.619 UTC [3323][1/0] : [133-1] CONTEXT: WAL redo at 9AA/F94F2700 for Btree/DELETE: 2 items -- postgresql-Mon.log-2019-04-08 00:17:31.686 UTC [3277][1/0] : [106-1] LOG: restored log file "00000006000009AA000000E2" from archive postgresql-Mon.log-2019-04-08 00:17:31.741 UTC [3277][1/0] : [107-1] WARNING: page 162136064 of relation base/16385/22606 does not exist postgresql-Mon.log-2019-04-08 00:17:31.741 UTC [3277][1/0] : [108-1] CONTEXT: WAL redo at 9AA/E269B0F0 for Btree/DELETE: 1 items postgresql-Mon.log:2019-04-08 00:17:31.742 UTC [3277][1/0] : [109-1] PANIC: WAL contains references to invalid pages postgresql-Mon.log-2019-04-08 00:17:31.742 UTC [3277][1/0] : [110-1] CONTEXT: WAL redo at 9AA/E269B0F0 for Btree/DELETE: 1 items postgresql-Mon.log-2019-04-08 00:17:31.815 UTC [3274][] : [11-1] LOG: startup process (PID 3277) was terminated by signal 6: Aborted postgresql-Mon.log-2019-04-08 00:17:31.815 UTC [3274][] : [12-1] LOG: terminating any other active server processes -- postgresql-Mon.log-2019-04-08 00:26:17.865 UTC [3288][1/0] : [149-1] WARNING: page 1936013626 of relation base/16385/22536 does not exist postgresql-Mon.log-2019-04-08 00:26:17.865 UTC [3288][1/0] : [150-1] CONTEXT: WAL redo at 9AB/C21CA50 for Btree/DELETE: 35 items postgresql-Mon.log:2019-04-08 00:26:17.865 UTC [3288][1/0] : [151-1] PANIC: WAL contains references to invalid pages postgresql-Mon.log-2019-04-08 00:26:17.865 UTC [3288][1/0] : [152-1] CONTEXT: WAL redo at 9AB/C21CA50 for Btree/DELETE: 35 items postgresql-Mon.log-2019-04-08 00:26:17.943 UTC [3285][] : [11-1] LOG: startup process (PID 3288) was terminated by signal 6: Aborted postgresql-Mon.log-2019-04-08 00:26:17.943 UTC [3285][] : [12-1] LOG: terminating any other active server processes -- postgresql-Mon.log-2019-04-08 00:36:01.466 UTC [3330][1/0] : [118-1] LOG: restored log file "00000006000009AA000000EE" from archive postgresql-Mon.log-2019-04-08 00:36:01.497 UTC [3330][1/0] : [119-1] WARNING: page 1432945756 of relation base/16385/21365 does not exist postgresql-Mon.log-2019-04-08 00:36:01.497 UTC [3330][1/0] : [120-1] CONTEXT: WAL redo at 9AA/EE183EC0 for Btree/DELETE: 43 items postgresql-Mon.log:2019-04-08 00:36:01.497 UTC [3330][1/0] : [121-1] PANIC: WAL contains references to invalid pages postgresql-Mon.log-2019-04-08 00:36:01.497 UTC [3330][1/0] : [122-1] CONTEXT: WAL redo at 9AA/EE183EC0 for Btree/DELETE: 43 items postgresql-Mon.log-2019-04-08 00:36:01.536 UTC [5403][] : [1-1] [app=[unknown]] LOG: connection received: host=172.16.101.249 port=48250 postgresql-Mon.log-2019-04-08 00:36:01.574 UTC [3327][] : [11-1] LOG: startup process (PID 3330) was terminated by signal 6: Aborted -- postgresql-Mon.log-2019-04-08 00:43:37.086 UTC [3338][1/0] : [121-1] LOG: restored log file "00000006000009AA000000F1" from archive postgresql-Mon.log-2019-04-08 00:43:37.134 UTC [3338][1/0] : [122-1] WARNING: page 825766448 of relation base/16385/21358 does not exist postgresql-Mon.log-2019-04-08 00:43:37.134 UTC [3338][1/0] : [123-1] CONTEXT: WAL redo at 9AA/F12FC1D0 for Btree/DELETE: 25 items postgresql-Mon.log:2019-04-08 00:43:37.134 UTC [3338][1/0] : [124-1] PANIC: WAL contains references to invalid pages postgresql-Mon.log-2019-04-08 00:43:37.134 UTC [3338][1/0] : [125-1] CONTEXT: WAL redo at 9AA/F12FC1D0 for Btree/DELETE: 25 items postgresql-Mon.log-2019-04-08 00:43:37.211 UTC [3335][] : [11-1] LOG: startup process (PID 3338) was terminated by signal 6: Aborted postgresql-Mon.log-2019-04-08 00:43:37.211 UTC [3335][] : [12-1] LOG: terminating any other active server processes -- postgresql-Mon.log-2019-04-08 00:52:30.100 UTC [3249][1/0] : [143-1] WARNING: page 162201600 of relation base/16385/22456 does not exist postgresql-Mon.log-2019-04-08 00:52:30.100 UTC [3249][1/0] : [144-1] CONTEXT: WAL redo at 9AB/6F53290 for Btree/DELETE: 3 items postgresql-Mon.log:2019-04-08 00:52:30.100 UTC [3249][1/0] : [145-1] PANIC: WAL contains references to invalid pages postgresql-Mon.log-2019-04-08 00:52:30.100 UTC [3249][1/0] : [146-1] CONTEXT: WAL redo at 9AB/6F53290 for Btree/DELETE: 3 items postgresql-Mon.log-2019-04-08 00:52:30.173 UTC [3246][] : [11-1] LOG: startup process (PID 3249) was terminated by signal 6: Aborted postgresql-Mon.log-2019-04-08 00:52:30.173 UTC [3246][] : [12-1] LOG: terminating any other active server processes -- postgresql-Mon.log-2019-04-08 01:01:56.487 UTC [3315][1/0] : [125-1] LOG: restored log file "00000006000009AA000000F5" from archive postgresql-Mon.log-2019-04-08 01:01:56.545 UTC [3315][1/0] : [126-1] WARNING: page 162136064 of relation base/16385/22456 does not exist postgresql-Mon.log-2019-04-08 01:01:56.545 UTC [3315][1/0] : [127-1] CONTEXT: WAL redo at 9AA/F57AF340 for Btree/DELETE: 1 items postgresql-Mon.log:2019-04-08 01:01:56.545 UTC [3315][1/0] : [128-1] PANIC: WAL contains references to invalid pages postgresql-Mon.log-2019-04-08 01:01:56.545 UTC [3315][1/0] : [129-1] CONTEXT: WAL redo at 9AA/F57AF340 for Btree/DELETE: 1 items -- postgresql-Mon.log-2019-04-08 01:10:55.964 UTC [3274][1/0] : [98-1] LOG: restored log file "00000006000009AA000000DA" from archive postgresql-Mon.log-2019-04-08 01:10:56.026 UTC [3274][1/0] : [99-1] WARNING: page 162201600 of relation base/16385/21379 does not exist postgresql-Mon.log-2019-04-08 01:10:56.026 UTC [3274][1/0] : [100-1] CONTEXT: WAL redo at 9AA/DA735378 for Btree/DELETE: 1 items postgresql-Mon.log:2019-04-08 01:10:56.026 UTC [3274][1/0] : [101-1] PANIC: WAL contains references to invalid pages postgresql-Mon.log-2019-04-08 01:10:56.026 UTC [3274][1/0] : [102-1] CONTEXT: WAL redo at 9AA/DA735378 for Btree/DELETE: 1 items postgresql-Mon.log-2019-04-08 01:10:56.099 UTC [3271][] : [11-1] LOG: startup process (PID 3274) was terminated by signal 6: Aborted postgresql-Mon.log-2019-04-08 01:10:56.099 UTC [3271][] : [12-1] LOG: terminating any other active server processes -- postgresql-Mon.log-2019-04-08 01:21:03.013 UTC [3326][1/0] : [266-1] LOG: restored log file "00000006000009AB00000082" from archive postgresql-Mon.log-2019-04-08 01:21:03.053 UTC [3326][1/0] : [267-1] WARNING: page 1886284143 of relation base/16385/21291 does not exist postgresql-Mon.log-2019-04-08 01:21:03.053 UTC [3326][1/0] : [268-1] CONTEXT: WAL redo at 9AB/823B9360 for Btree/DELETE: 12 items postgresql-Mon.log:2019-04-08 01:21:03.053 UTC [3326][1/0] : [269-1] PANIC: WAL contains references to invalid pages postgresql-Mon.log-2019-04-08 01:21:03.053 UTC [3326][1/0] : [270-1] CONTEXT: WAL redo at 9AB/823B9360 for Btree/DELETE: 12 items postgresql-Mon.log-2019-04-08 01:21:03.134 UTC [3323][] : [11-1] LOG: startup process (PID 3326) was terminated by signal 6: Aborted postgresql-Mon.log-2019-04-08 01:21:03.134 UTC [3323][] : [12-1] LOG: terminating any other active server processes -- postgresql-Mon.log-2019-04-08 01:30:46.438 UTC [3333][1/0] : [95-1] WARNING: page 1934703203 of relation base/16385/21306 does not exist postgresql-Mon.log-2019-04-08 01:30:46.438 UTC [3333][1/0] : [96-1] CONTEXT: WAL redo at 9AA/D6404470 for Btree/DELETE: 9 items postgresql-Mon.log:2019-04-08 01:30:46.438 UTC [3333][1/0] : [97-1] PANIC: WAL contains references to invalid pages postgresql-Mon.log-2019-04-08 01:30:46.438 UTC [3333][1/0] : [98-1] CONTEXT: WAL redo at 9AA/D6404470 for Btree/DELETE: 9 items postgresql-Mon.log-2019-04-08 01:30:46.519 UTC [3330][] : [11-1] LOG: startup process (PID 3333) was terminated by signal 6: Aborted postgresql-Mon.log-2019-04-08 01:30:46.519 UTC [3330][] : [12-1] LOG: terminating any other active server processes -- postgresql-Sun.log-2019-04-07 22:43:15.774 UTC [17013][1/0] : [121-1] LOG: restored log file "00000006000009AA000000F1" from archive postgresql-Sun.log-2019-04-07 22:43:15.824 UTC [17013][1/0] : [122-1] WARNING: page 825766448 of relation base/16385/21358 does not exist postgresql-Sun.log-2019-04-07 22:43:15.824 UTC [17013][1/0] : [123-1] CONTEXT: WAL redo at 9AA/F12FC1D0 for Btree/DELETE: 25 items postgresql-Sun.log:2019-04-07 22:43:15.824 UTC [17013][1/0] : [124-1] PANIC: WAL contains references to invalid pages postgresql-Sun.log-2019-04-07 22:43:15.824 UTC [17013][1/0] : [125-1] CONTEXT: WAL redo at 9AA/F12FC1D0 for Btree/DELETE: 25 items postgresql-Sun.log-2019-04-07 22:43:15.904 UTC [17010][] : [11-1] LOG: startup process (PID 17013) was terminated by signal 6: Aborted postgresql-Sun.log-2019-04-07 22:43:15.904 UTC [17010][] : [12-1] LOG: terminating any other active server processes -- postgresql-Sun.log-2019-04-07 22:44:15.699 UTC [19068][1/0] : [144-1] LOG: restored log file "00000006000009AB00000008" from archive postgresql-Sun.log-2019-04-07 22:44:15.737 UTC [19068][1/0] : [145-1] WARNING: page 162201600 of relation base/16385/21365 does not exist postgresql-Sun.log-2019-04-07 22:44:15.737 UTC [19068][1/0] : [146-1] CONTEXT: WAL redo at 9AB/854FF48 for Btree/DELETE: 2 items postgresql-Sun.log:2019-04-07 22:44:15.737 UTC [19068][1/0] : [147-1] PANIC: WAL contains references to invalid pages postgresql-Sun.log-2019-04-07 22:44:15.737 UTC [19068][1/0] : [148-1] CONTEXT: WAL redo at 9AB/854FF48 for Btree/DELETE: 2 items postgresql-Sun.log-2019-04-07 22:44:15.827 UTC [19065][] : [11-1] LOG: startup process (PID 19068) was terminated by signal 6: Aborted postgresql-Sun.log-2019-04-07 22:44:15.827 UTC [19065][] : [12-1] LOG: terminating any other active server processes -- postgresql-Sun.log-2019-04-07 22:51:58.298 UTC [3306][1/0] : [121-1] LOG: restored log file "00000006000009AA000000F1" from archive postgresql-Sun.log-2019-04-07 22:51:58.342 UTC [3306][1/0] : [122-1] WARNING: page 825766448 of relation base/16385/21358 does not exist postgresql-Sun.log-2019-04-07 22:51:58.342 UTC [3306][1/0] : [123-1] CONTEXT: WAL redo at 9AA/F12FC1D0 for Btree/DELETE: 25 items postgresql-Sun.log:2019-04-07 22:51:58.342 UTC [3306][1/0] : [124-1] PANIC: WAL contains references to invalid pages postgresql-Sun.log-2019-04-07 22:51:58.342 UTC [3306][1/0] : [125-1] CONTEXT: WAL redo at 9AA/F12FC1D0 for Btree/DELETE: 25 items postgresql-Sun.log-2019-04-07 22:51:58.420 UTC [3303][] : [11-1] LOG: startup process (PID 3306) was terminated by signal 6: Aborted postgresql-Sun.log-2019-04-07 22:51:58.421 UTC [3303][] : [12-1] LOG: terminating any other active server processes -- postgresql-Sun.log-2019-04-07 23:02:20.964 UTC [5948][] : [1-1] [app=[unknown]] LOG: connection received: host=3.88.80.230 port=57984 postgresql-Sun.log-2019-04-07 23:02:21.041 UTC [3298][1/0] : [155-1] WARNING: page 825439802 of relation base/16385/21231 does not exist postgresql-Sun.log-2019-04-07 23:02:21.041 UTC [3298][1/0] : [156-1] CONTEXT: WAL redo at 9AB/12A04DA0 for Btree/DELETE: 14 items postgresql-Sun.log:2019-04-07 23:02:21.041 UTC [3298][1/0] : [157-1] PANIC: WAL contains references to invalid pages postgresql-Sun.log-2019-04-07 23:02:21.041 UTC [3298][1/0] : [158-1] CONTEXT: WAL redo at 9AB/12A04DA0 for Btree/DELETE: 14 items postgresql-Sun.log-2019-04-07 23:02:21.117 UTC [3295][] : [11-1] LOG: startup process (PID 3298) was terminated by signal 6: Aborted postgresql-Sun.log-2019-04-07 23:02:21.117 UTC [3295][] : [12-1] LOG: terminating any other active server processes -- postgresql-Sun.log-2019-04-07 23:12:15.077 UTC [3305][1/0] : [112-1] WARNING: page 162136064 of relation base/16385/21351 does not exist postgresql-Sun.log-2019-04-07 23:12:15.077 UTC [3305][1/0] : [113-1] CONTEXT: WAL redo at 9AA/E7AF0838 for Btree/DELETE: 4 items postgresql-Sun.log:2019-04-07 23:12:15.077 UTC [3305][1/0] : [114-1] PANIC: WAL contains references to invalid pages postgresql-Sun.log-2019-04-07 23:12:15.077 UTC [3305][1/0] : [115-1] CONTEXT: WAL redo at 9AA/E7AF0838 for Btree/DELETE: 4 items postgresql-Sun.log-2019-04-07 23:12:15.133 UTC [5203][] : [1-1] [app=[unknown]] LOG: connection received: host=172.16.101.249 port=60556 postgresql-Sun.log-2019-04-07 23:12:15.149 UTC [3302][] : [11-1] LOG: startup process (PID 3305) was terminated by signal 6: Aborted -- postgresql-Sun.log-2019-04-07 23:21:02.554 UTC [5377][] : [1-1] [app=[unknown]] LOG: connection received: host=172.16.101.249 port=49380 postgresql-Sun.log-2019-04-07 23:21:02.588 UTC [3258][1/0] : [130-1] WARNING: page 162136064 of relation base/16385/21372 does not exist postgresql-Sun.log-2019-04-07 23:21:02.588 UTC [3258][1/0] : [131-1] CONTEXT: WAL redo at 9AA/F94F2700 for Btree/DELETE: 2 items postgresql-Sun.log:2019-04-07 23:21:02.588 UTC [3258][1/0] : [132-1] PANIC: WAL contains references to invalid pages postgresql-Sun.log-2019-04-07 23:21:02.588 UTC [3258][1/0] : [133-1] CONTEXT: WAL redo at 9AA/F94F2700 for Btree/DELETE: 2 items postgresql-Sun.log-2019-04-07 23:21:02.663 UTC [3255][] : [11-1] LOG: startup process (PID 3258) was terminated by signal 6: Aborted postgresql-Sun.log-2019-04-07 23:21:02.663 UTC [3255][] : [12-1] LOG: terminating any other active server processes -- postgresql-Sun.log-2019-04-07 23:30:34.035 UTC [3272][1/0] : [135-1] WARNING: page 162136064 of relation base/16385/21365 does not exist postgresql-Sun.log-2019-04-07 23:30:34.035 UTC [3272][1/0] : [136-1] CONTEXT: WAL redo at 9AA/FE2A8BE8 for Btree/DELETE: 2 items postgresql-Sun.log:2019-04-07 23:30:34.035 UTC [3272][1/0] : [137-1] PANIC: WAL contains references to invalid pages postgresql-Sun.log-2019-04-07 23:30:34.035 UTC [3272][1/0] : [138-1] CONTEXT: WAL redo at 9AA/FE2A8BE8 for Btree/DELETE: 2 items postgresql-Sun.log-2019-04-07 23:30:34.111 UTC [3269][] : [11-1] LOG: startup process (PID 3272) was terminated by signal 6: Aborted postgresql-Sun.log-2019-04-07 23:30:34.111 UTC [3269][] : [12-1] LOG: terminating any other active server processes -- postgresql-Sun.log-2019-04-07 23:40:31.513 UTC [3323][1/0] : [187-1] WARNING: page 2710832221 of relation base/16385/21393 does not exist postgresql-Sun.log-2019-04-07 23:40:31.513 UTC [3323][1/0] : [188-1] CONTEXT: WAL redo at 9AB/32811B38 for Btree/DELETE: 31 items postgresql-Sun.log:2019-04-07 23:40:31.513 UTC [3323][1/0] : [189-1] PANIC: WAL contains references to invalid pages postgresql-Sun.log-2019-04-07 23:40:31.513 UTC [3323][1/0] : [190-1] CONTEXT: WAL redo at 9AB/32811B38 for Btree/DELETE: 31 items postgresql-Sun.log-2019-04-07 23:40:31.589 UTC [3320][] : [11-1] LOG: startup process (PID 3323) was terminated by signal 6: Aborted postgresql-Sun.log-2019-04-07 23:40:31.589 UTC [3320][] : [12-1] LOG: terminating any other active server processes -- postgresql-Sun.log-2019-04-07 23:48:15.992 UTC [3290][1/0] : [89-1] LOG: restored log file "00000006000009AA000000D1" from archive postgresql-Sun.log-2019-04-07 23:48:16.099 UTC [3290][1/0] : [90-1] WARNING: page 162136064 of relation base/16385/21351 does not exist postgresql-Sun.log-2019-04-07 23:48:16.099 UTC [3290][1/0] : [91-1] CONTEXT: WAL redo at 9AA/D1D92508 for Btree/DELETE: 8 items postgresql-Sun.log:2019-04-07 23:48:16.099 UTC [3290][1/0] : [92-1] PANIC: WAL contains references to invalid pages postgresql-Sun.log-2019-04-07 23:48:16.099 UTC [3290][1/0] : [93-1] CONTEXT: WAL redo at 9AA/D1D92508 for Btree/DELETE: 8 items postgresql-Sun.log-2019-04-07 23:48:16.176 UTC [3287][] : [11-1] LOG: startup process (PID 3290) was terminated by signal 6: Aborted postgresql-Sun.log-2019-04-07 23:48:16.176 UTC [3287][] : [12-1] LOG: terminating any other active server processes -- postgresql-Sun.log-2019-04-07 23:58:31.375 UTC [3307][1/0] : [89-1] LOG: restored log file "00000006000009AA000000D1" from archive postgresql-Sun.log-2019-04-07 23:58:31.476 UTC [3307][1/0] : [90-1] WARNING: page 162136064 of relation base/16385/21351 does not exist postgresql-Sun.log-2019-04-07 23:58:31.476 UTC [3307][1/0] : [91-1] CONTEXT: WAL redo at 9AA/D1D92508 for Btree/DELETE: 8 items postgresql-Sun.log:2019-04-07 23:58:31.476 UTC [3307][1/0] : [92-1] PANIC: WAL contains references to invalid pages postgresql-Sun.log-2019-04-07 23:58:31.476 UTC [3307][1/0] : [93-1] CONTEXT: WAL redo at 9AA/D1D92508 for Btree/DELETE: 8 items postgresql-Sun.log-2019-04-07 23:58:31.552 UTC [3304][] : [11-1] LOG: startup process (PID 3307) was terminated by signal 6: Aborted
Hi Daniel, On Tue, Apr 9, 2019 at 1:30 PM PG Bug reporting form <noreply@postgresql.org> wrote: > But, for serendipitous reasons, I let this one run for a while. As it turns > out, with each crash, it would make *slightly* more progress than the time > before....and then eventually, it suffered no more faults and caught up > normally. Included is a log that shows how sparse these faults were, > relative to all the traffic going on....: roughly two per segment on this > workload, with large gaps between problematic segments, and not necessarily > repetition in a problematic relation or filenode. That sounds weird. > The fact the standby eventually came up made me suspicious, so I ran amcheck > with a heap re-check, and, no tuples were in violation. > > Included is a log, which shows how the system recovered over and over, > making slight progress each time. This is the entire inventory after such > crashes: after these, the system passed amcheck and appears to work > normally. Did you try bt_index_parent_check('rel', true)? You might want to make sure that work_mem is set sufficiently high so that the downlink-block-is-present check is definitely effective; work_mem bounds the size of a Bloom filter used by the implementation (the heap verification option has its own Bloom filter, bound by maintenance_work_mem). Suggest that you "set client_min_messages=debug1" before running amcheck this way, just in case that shows something interesting. > postgresql-Mon.log-2019-04-08 00:08:22.619 UTC [3323][1/0] : [130-1] > WARNING: page 162136064 of relation base/16385/21372 does not exist These WARNING messages all reference block numbers that look like 32-bits of random garbage, but could be from a very large relation. The relevant WAL record is from B-Tree's opportunistic LP_DEAD garbage collection (not VACUUM). Note that Andres changed this mechanism for v12, so that latestRemovedXid was calculated on the primary, rather than on the standby. I think that this error comes from btree_xlog_delete_get_latestRemovedXid(), which is in 11 but not master/12. I wonder, is "base/16385/21351" the index or the table? Is it possible to run pg_waldump? I think it's the table. If the problem is in btree_xlog_delete_get_latestRemovedXid(), then it is perhaps unsurprising that there isn't evidence of any lasting corruption. -- Peter Geoghegan
On Sat, Apr 27, 2019 at 8:28 PM Peter Geoghegan <pg@bowt.ie> wrote:
Hi Daniel,
On Tue, Apr 9, 2019 at 1:30 PM PG Bug reporting form
<noreply@postgresql.org> wrote:
> But, for serendipitous reasons, I let this one run for a while. As it turns
> out, with each crash, it would make *slightly* more progress than the time
> before....and then eventually, it suffered no more faults and caught up
> normally. Included is a log that shows how sparse these faults were,
> relative to all the traffic going on....: roughly two per segment on this
> workload, with large gaps between problematic segments, and not necessarily
> repetition in a problematic relation or filenode.
That sounds weird.
Yeah. It was.
> The fact the standby eventually came up made me suspicious, so I ran amcheck
> with a heap re-check, and, no tuples were in violation.
>
> Included is a log, which shows how the system recovered over and over,
> making slight progress each time. This is the entire inventory after such
> crashes: after these, the system passed amcheck and appears to work
> normally.
Did you try bt_index_parent_check('rel', true)? You might want to make
sure that work_mem is set sufficiently high so that the
downlink-block-is-present check is definitely effective; work_mem
bounds the size of a Bloom filter used by the implementation (the heap
verification option has its own Bloom filter, bound by
maintenance_work_mem). Suggest that you "set
client_min_messages=debug1" before running amcheck this way, just in
case that shows something interesting.
No, I didn't want to take that lock...but I'll keep in it mind for next time, though I'll have to make arrangements.
> postgresql-Mon.log-2019-04-08 00:08:22.619 UTC [3323][1/0] : [130-1]
> WARNING: page 162136064 of relation base/16385/21372 does not exist
These WARNING messages all reference block numbers that look like
32-bits of random garbage, but could be from a very large relation.
Definitely not a 1.3TiB relation. Good eye.
The relevant WAL record is from B-Tree's opportunistic LP_DEAD garbage
collection (not VACUUM). Note that Andres changed this mechanism for
v12, so that latestRemovedXid was calculated on the primary, rather
than on the standby. I think that this error comes from
btree_xlog_delete_get_latestRemovedXid(), which is in 11 but not
master/12.
I wonder, is "base/16385/21351" the index or the table? Is it possible
to run pg_waldump? I think it's the table.
If the problem is in btree_xlog_delete_get_latestRemovedXid(), then it
is perhaps unsurprising that there isn't evidence of any lasting
corruption.
Regrettably, it's too late, but I'll have my notes for next time.
On Sun, May 05, 2019 at 08:37:02PM -0700, Daniel Farina wrote: > Regrettably, it's too late, but I'll have my notes for next time. Here is a guess: the base backup which has been used for this instance's recovery has been taken from a standby, where the minimum recovery LSN registered in its control file did not actually map to the last physical LSN where a page has been flushed by either the checkpointer, the startup process or a read-only backend. This reminds me of the problems between 3c64dcb and c186ba1 I was at the origin of, but this refers to 11.1, which was released after the fact. Or perhaps a beta version of 11 has been used? -- Michael