BUG #13797: file "pg_multixact/members/79D4" doesn't exist, reading as zeroes
От | plockaby@uw.edu |
---|---|
Тема | BUG #13797: file "pg_multixact/members/79D4" doesn't exist, reading as zeroes |
Дата | |
Msg-id | 20151204225806.2618.8469@wrigleys.postgresql.org обсуждение исходный текст |
Список | pgsql-bugs |
The following bug has been logged on the website: Bug reference: 13797 Logged by: Paul Lockaby Email address: plockaby@uw.edu PostgreSQL version: 9.4.5 Operating system: CentOS 6 Description: We have three postgres servers configured with cascading replication: 1->2->3. The primary stayed up. The replicas both failed simultaneously. I see this in the error logs: Dec 4 04:05:12 mario03 postgres[26467]: [38-1] LOG: file "pg_multixact/members/79D4" doesn't exist, reading as zeroes Dec 4 04:05:12 mario03 postgres[26467]: [38-2] CONTEXT: xlog redo create mxid 4185669 offset 1632733906 nmembers 26561: 41794592 (keysh) 41794593 (fornokeyupd) 41794594 (fornokeyupd) 41794595 (fornokeyupd) 41794596 (fornokeyupd) 41794597 (fornokeyupd) 41794598 (fornokeyupd) 41794599 (fornokeyupd) 41794600 (fornokeyupd) 41794601 (fornokeyupd) 41794602 (fornokeyupd) 41794603 (fornokeyupd) 41794604 (fornokeyupd) 41794605 (fornokeyupd) 41794606 (fornokeyupd) 41794607 (fornokeyupd) 41794608 (fornokeyupd) 41794609 (fornokeyupd) 41794610 (fornokeyupd) 41794611 (fornokeyupd) 41794612 (fornokeyupd) 41794613 (fornokeyupd) 41794614 (fornokeyupd) 41794615 (fornokeyupd) 41794616 (fornokeyupd) 41794617 (fornokeyupd) 41794618 (fornokeyupd) 41794619 (fornokeyupd) 41794620 (fornokeyupd) 41794621 (fornokeyupd) 41794622 (fornokeyupd) 41794623 (fornokeyupd) 41794624 (fornokeyupd) 41794625 (fornokeyupd) 41794626 (fornokeyupd) 41794627 (fornokeyupd) Dec 4 04:05:12 mario03 postgres[26467]: [38-3] 41794628 (fornokeyupd) 41794629 (fornokeyupd) 41794630 (fornokeyupd) 41794631 (fornokeyupd) 41794632 (fornokeyupd) 41794633 (fornokeyupd) 41794634 (fornokeyupd) 41794635 (fornokeyupd) 41794636 (fornokeyupd) 41794637 (fornokeyupd) 41794638 (fornokeyupd) 41794639 (fornokeyupd) 41794640 (fornokeyupd) 41794641 (fornokeyupd) 41794642 (fornokeyupd) 41794643 (fornokeyupd) 41794644 (fornokeyupd) 41794645 (fornokeyupd) 41794646 (fornokeyupd) 41794647 (fornokeyupd) 41794648 (fornokeyupd) 41794649 (fornokeyupd) 41794650 (fornokeyupd) 41794651 (fornokeyupd) 41794652 (fornokeyupd) 41794653 (fornokeyupd) 41794654 (fornokeyupd) 41794655 (fornokeyupd) 41794656 (fornokeyupd) 41794657 (fornokeyupd) 41794658 (fornokeyupd) 41794659 (fornokeyupd) 41794660 (fornokeyupd) 41794661 (fornokeyupd) 41794662 (fornokeyupd) 41794663 (fornokeyupd) 41794664 (fornokeyupd) 41794665 (fornokeyupd) 41794666 (fornokeyupd) .... The fornokeyupd error message repeated dozens of times for 2 seconds and then appeared to recover. The same error about a different file in pg_multixact appeared 52 minutes later only to recover. And again it occurred about 30 minutes later. Both replicas did this throughout the day until one of them ended this way: Dec 4 13:05:30 mario03 postgres[3617]: [6-1] FATAL: could not receive data from WAL stream: SSL SYSCALL error: EOF detected Dec 4 13:05:30 mario03 postgres[3617]: [6-2] #011 Dec 4 13:05:30 mario03 postgres[26466]: [18-1] LOG: startup process (PID 26467) exited with exit code 1 Dec 4 13:05:30 mario03 postgres[26466]: [19-1] LOG: terminating any other active server processes Dec 4 13:05:30 mario03 postgres[21981]: [10-1] WARNING: terminating connection because of crash of another server process Dec 4 13:05:30 mario03 postgres[21981]: [10-2] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. Dec 4 13:05:30 mario03 postgres[21981]: [10-3] HINT: In a moment you should be able to reconnect to the database and repeat your command. I attempted to restart the server and it attempted to start and appeared to be processing wal files (as in the wal receiver counter kept incrementing. But the server never left "the database system is starting up" state: Dec 4 14:39:25 mario03 postgres[24794]: [1-1] LOG: ending log output to stderr Dec 4 14:39:25 mario03 postgres[24794]: [1-2] HINT: Future log output will go to log destination "syslog". Dec 4 14:39:25 mario03 postgres[24795]: [2-1] LOG: database system was interrupted while in recovery at log time 2015-12-04 13:02:36 PST Dec 4 14:39:25 mario03 postgres[24795]: [2-2] HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. Dec 4 14:39:26 mario03 postgres[24795]: [3-1] LOG: entering standby mode Dec 4 14:39:26 mario03 postgres[24795]: [4-1] LOG: redo starts at 219/A81D9EF8 Dec 4 14:39:27 mario03 postgres[24795]: [5-1] LOG: consistent recovery state reached at 219/A84A6FC0 Dec 4 14:39:27 mario03 postgres[24795]: [6-1] LOG: invalid record length at 219/A84E0FC8 Dec 4 14:39:27 mario03 postgres[24798]: [2-1] FATAL: could not connect to the primary server: FATAL: the database system is starting up Dec 4 14:39:27 mario03 postgres[24798]: [2-2] #011 Dec 4 14:39:35 mario03 postgres[24842]: [2-1] LOG: connection received: host=[local] Dec 4 14:39:35 mario03 postgres[24842]: [3-1] FATAL: the database system is starting up Dec 4 14:39:42 mario03 postgres[24888]: [2-1] FATAL: could not connect to the primary server: FATAL: the database system is starting up Dec 4 14:39:42 mario03 postgres[24888]: [2-2] #011 The contents of pg_multixact/members contains 34,533 files dating all the way back to when we first turned the cluster on, October 26th. In fact the file 79D4 does indeed exist: -rw------- 1 postgres users 262144 Dec 4 04:05 79D4 Ultimately we had to rebuild the replicas with pg_basebackup. This same exact configuration worked on Postgres 9.1 for literally years without ever restarting or crashing so I'm a bit flummoxed about this.
В списке pgsql-bugs по дате отправления: