Обсуждение: PANIC while doing failover (streaming replication)

Поиск

Список

Период

Сортировка

PANIC while doing failover (streaming replication)

От

Mikko Partio

Дата:

01 июля 2011 г., 05:18:47

Hello list

I have two a machine cluster with PostgreSQL 9.0.4 and streaming replication. In a normal situation I did a failover -- touched the trigger file in standby to promote it to production mode. I have done this previously without any complications but now the to-be-production-database had a PANIC and shut itself down. From the logs:

postgres[10751]: [2-1] 2011-06-30 17:25:24 EEST [10751]: [1-1] user=,db= LOG: streaming replication successfully connected to primary

postgres[10736]: [10-1] 2011-07-01 07:50:29 EEST [10736]: [10-1] user=,db= LOG: trigger file found: /postgresql/data/finish_replication.trigger

postgres[10751]: [3-1] 2011-07-01 07:50:29 EEST [10751]: [2-1] user=,db= FATAL: terminating walreceiver process due to administrator command

postgres[10736]: [11-1] 2011-07-01 07:50:30 EEST [10736]: [11-1] user=,db= LOG: redo done at AE/8B1855C8

postgres[10736]: [12-1] 2011-07-01 07:50:30 EEST [10736]: [12-1] user=,db= LOG: last completed transaction was at log time 2011-07-01 07:50:14.67424+03

postgres[10736]: [13-1] 2011-07-01 07:50:30 EEST [10736]: [13-1] user=,db= LOG: restored log file "00000006000000AE0000008B" from archive

postgres[10736]: [14-1] 2011-07-01 07:50:30 EEST [10736]: [14-1] user=,db= LOG: selected new timeline ID: 7

postgres[10736]: [15-1] 2011-07-01 07:50:30 EEST [10736]: [15-1] user=,db= LOG: restored log file "00000006.history" from archive

postgres[10736]: [16-1] 2011-07-01 07:50:30 EEST [10736]: [16-1] user=,db= LOG: archive recovery complete

postgres[10736]: [17-1] 2011-07-01 07:50:30 EEST [10736]: [17-1] user=,db= WARNING: page 68975 of relation base/3711580/4398155 was uninitialized

postgres[10736]: [18-1] 2011-07-01 07:50:30 EEST [10736]: [18-1] user=,db= WARNING: page 782 of relation base/3711580/4395394 was uninitialized

postgres[10736]: [19-1] 2011-07-01 07:50:30 EEST [10736]: [19-1] user=,db= WARNING: page 68948 of relation base/3711580/4398155 was uninitialized

postgres[10736]: [20-1] 2011-07-01 07:50:30 EEST [10736]: [20-1] user=,db= WARNING: page 68986 of relation base/3711580/4398155 was uninitialized

[ there are maybe 50 lines of these WARNING messages for all together six different relations, all indexes ]

postgres[10736]: [73-1] 2011-07-01 07:50:31 EEST [10736]: [73-1] user=,db= PANIC: WAL contains references to invalid pages

postgres[10727]: [2-1] 2011-07-01 07:50:31 EEST [10727]: [2-1] user=,db= LOG: startup process (PID 10736) was terminated by signal 6: Aborted

postgres[10727]: [3-1] 2011-07-01 07:50:31 EEST [10727]: [3-1] user=,db= LOG: terminating any other active server processes

Then I started postgres as it was shut down:

postgres[12191]: [1-1] 2011-07-01 07:50:59 EEST [12191]: [1-1] user=,db= LOG: database system was interrupted while in recovery at log time 2011-07-01 07:44:02 EEST

postgres[12191]: [1-2] 2011-07-01 07:50:59 EEST [12191]: [2-1] user=,db= HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.

postgres[12191]: [2-1] 2011-07-01 07:50:59 EEST [12191]: [3-1] user=,db= LOG: database system was not properly shut down; automatic recovery in progress

postgres[12191]: [3-1] 2011-07-01 07:50:59 EEST [12191]: [4-1] user=,db= LOG: redo starts at AE/8B13E398

postgres[12191]: [4-1] 2011-07-01 07:50:59 EEST [12191]: [5-1] user=,db= LOG: consistent recovery state reached at AE/8C000000

postgres[12191]: [5-1] 2011-07-01 07:50:59 EEST [12191]: [6-1] user=,db= LOG: unexpected pageaddr AE/6E000000 in log file 174, segment 140, offset 0

postgres[12191]: [6-1] 2011-07-01 07:50:59 EEST [12191]: [7-1] user=,db= LOG: redo done at AE/8B1855C8

postgres[12191]: [7-1] 2011-07-01 07:50:59 EEST [12191]: [8-1] user=,db= LOG: last completed transaction was at log time 2011-07-01 07:50:14.67424+03

postgres[12194]: [1-1] 2011-07-01 07:50:59 EEST [12194]: [1-1] user=,db= LOG: autovacuum launcher started

postgres[12189]: [1-1] 2011-07-01 07:50:59 EEST [12189]: [1-1] user=,db= LOG: database system is ready to accept connections

Are these log messages something to worry about?

Regards

Mikko

Re: PANIC while doing failover (streaming replication)

От

Fujii Masao

Дата:

01 июля 2011 г., 05:54:56

On Fri, Jul 1, 2011 at 2:18 PM, Mikko Partio <mpartio@gmail.com> wrote:
> Hello list
> I have two a machine cluster with PostgreSQL 9.0.4 and streaming
> replication. In a normal situation I did a failover -- touched the trigger
> file in standby to promote it to production mode. I have done this
> previously without any complications but now the to-be-production-database
> had a PANIC and shut itself down. From the logs:
> postgres[10751]: [2-1] 2011-06-30 17:25:24 EEST [10751]: [1-1] user=,db=
> LOG:  streaming replication successfully connected to primary
> postgres[10736]: [10-1] 2011-07-01 07:50:29 EEST [10736]: [10-1] user=,db=
> LOG:  trigger file found: /postgresql/data/finish_replication.trigger
> postgres[10751]: [3-1] 2011-07-01 07:50:29 EEST [10751]: [2-1] user=,db=
> FATAL:  terminating walreceiver process due to administrator command
> postgres[10736]: [11-1] 2011-07-01 07:50:30 EEST [10736]: [11-1] user=,db=
> LOG:  redo done at AE/8B1855C8
> postgres[10736]: [12-1] 2011-07-01 07:50:30 EEST [10736]: [12-1] user=,db=
> LOG:  last completed transaction was at log time 2011-07-01
> 07:50:14.67424+03
> postgres[10736]: [13-1] 2011-07-01 07:50:30 EEST [10736]: [13-1] user=,db=
> LOG:  restored log file "00000006000000AE0000008B" from archive
> postgres[10736]: [14-1] 2011-07-01 07:50:30 EEST [10736]: [14-1] user=,db=
> LOG:  selected new timeline ID: 7
> postgres[10736]: [15-1] 2011-07-01 07:50:30 EEST [10736]: [15-1] user=,db=
> LOG:  restored log file "00000006.history" from archive
> postgres[10736]: [16-1] 2011-07-01 07:50:30 EEST [10736]: [16-1] user=,db=
> LOG:  archive recovery complete
> postgres[10736]: [17-1] 2011-07-01 07:50:30 EEST [10736]: [17-1] user=,db=
> WARNING:  page 68975 of relation base/3711580/4398155 was uninitialized
> postgres[10736]: [18-1] 2011-07-01 07:50:30 EEST [10736]: [18-1] user=,db=
> WARNING:  page 782 of relation base/3711580/4395394 was uninitialized
> postgres[10736]: [19-1] 2011-07-01 07:50:30 EEST [10736]: [19-1] user=,db=
> WARNING:  page 68948 of relation base/3711580/4398155 was uninitialized
> postgres[10736]: [20-1] 2011-07-01 07:50:30 EEST [10736]: [20-1] user=,db=
> WARNING:  page 68986 of relation base/3711580/4398155 was uninitialized
> [ there are maybe 50 lines of these WARNING messages for all together six
> different relations, all indexes ]
> postgres[10736]: [73-1] 2011-07-01 07:50:31 EEST [10736]: [73-1] user=,db=
> PANIC:  WAL contains references to invalid pages
> postgres[10727]: [2-1] 2011-07-01 07:50:31 EEST [10727]: [2-1] user=,db=
> LOG:  startup process (PID 10736) was terminated by signal 6: Aborted
> postgres[10727]: [3-1] 2011-07-01 07:50:31 EEST [10727]: [3-1] user=,db=
> LOG:  terminating any other active server processes
> Then I started postgres as it was shut down:
> postgres[12191]: [1-1] 2011-07-01 07:50:59 EEST [12191]: [1-1] user=,db=
> LOG:  database system was interrupted while in recovery at log time
> 2011-07-01 07:44:02 EEST
> postgres[12191]: [1-2] 2011-07-01 07:50:59 EEST [12191]: [2-1] user=,db=
> HINT:  If this has occurred more than once some data might be corrupted and
> you might need to choose an earlier recovery target.
> postgres[12191]: [2-1] 2011-07-01 07:50:59 EEST [12191]: [3-1] user=,db=
> LOG:  database system was not properly shut down; automatic recovery in
> progress
> postgres[12191]: [3-1] 2011-07-01 07:50:59 EEST [12191]: [4-1] user=,db=
> LOG:  redo starts at AE/8B13E398
> postgres[12191]: [4-1] 2011-07-01 07:50:59 EEST [12191]: [5-1] user=,db=
> LOG:  consistent recovery state reached at AE/8C000000
> postgres[12191]: [5-1] 2011-07-01 07:50:59 EEST [12191]: [6-1] user=,db=
> LOG:  unexpected pageaddr AE/6E000000 in log file 174, segment 140, offset 0
> postgres[12191]: [6-1] 2011-07-01 07:50:59 EEST [12191]: [7-1] user=,db=
> LOG:  redo done at AE/8B1855C8
> postgres[12191]: [7-1] 2011-07-01 07:50:59 EEST [12191]: [8-1] user=,db=
> LOG:  last completed transaction was at log time 2011-07-01
> 07:50:14.67424+03
> postgres[12194]: [1-1] 2011-07-01 07:50:59 EEST [12194]: [1-1] user=,db=
> LOG:  autovacuum launcher started
> postgres[12189]: [1-1] 2011-07-01 07:50:59 EEST [12189]: [1-1] user=,db=
> LOG:  database system is ready to accept connections
> Are these log messages something to worry about?

http://archives.postgresql.org/pgsql-hackers/2008-12/msg01335.php
According to the above thread, this issue seems to exist for a few years,
and be unsolved yet. Could you provide a self-contained test case?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: PANIC while doing failover (streaming replication)

От

Simon Riggs

Дата:

01 июля 2011 г., 07:16:26

On Fri, Jul 1, 2011 at 6:54 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

> http://archives.postgresql.org/pgsql-hackers/2008-12/msg01335.php
> According to the above thread, this issue seems to exist for a few years,
> and be unsolved yet. Could you provide a self-contained test case?

I'll add some more logging to help diagnose this. Those log reports
are mostly useless in the form they are in now.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: PANIC while doing failover (streaming replication)

От

Mikko Partio

Дата:

01 июля 2011 г., 11:00:44

http://archives.postgresql.org/pgsql-hackers/2008-12/msg01335.php
According to the above thread, this issue seems to exist for a few years,
and be unsolved yet. Could you provide a self-contained test case?

I can try ...

In related to this, after the failover we got several other worrying messages in the log.

During vacuums lots of these:

WARNING: relation "pg_largeobject" page 5333088 is uninitialized --- fixing

For several different tables. I gather that since these are mere WARNINGs they do not affect the server functionality or data integrity? Then, for one table we got:

ERROR: unexpected chunk number 2 (expected 0) for toast value 4398162 in pg_toast_1247555

which is quite worrying since we had to restore that table from backups. Are these errors caused by the PANIC'd shutdown or are they a symptom of something else. The servers are quite new and no server crashes have happened.

Regards

Mikko

Re: [HACKERS] PANIC while doing failover (streaming replication)

От

Matheus Ricardo Espanhol

Дата:

20 июня 2012 г., 14:53:19

Hi,

Any news on this issue? My slave server is not trusted, all the warnings below are related to
indexes of the main tables:

2012-06-14 11:46:23 BRT [18654]: [34603-1] user=,db= LOG: recovery restart point at 435/4E899CE8
2012-06-14 11:46:23 BRT [18654]: [34604-1] user=,db= DETAIL: last completed transaction was at log time 2012-06-14 11:46:22.770348-03
2012-06-14 11:48:53 BRT [18654]: [34605-1] user=,db= LOG: restartpoint starting: time
2012-06-14 11:51:09 BRT [18653]: [18-1] user=,db= LOG: received promote request
2012-06-14 11:51:09 BRT [11879]: [2-1] user=,db= FATAL: terminating walreceiver process due to administrator command
2012-06-14 11:51:09 BRT [18653]: [19-1] user=,db= LOG: invalid record length at 435/5079CAE8
2012-06-14 11:51:09 BRT [18653]: [20-1] user=,db= LOG: redo done at 435/5079CAA8
2012-06-14 11:51:09 BRT [18653]: [21-1] user=,db= LOG: last completed transaction was at log time 2012-06-14 11:51:08.969771-03
2012-06-14 11:51:09 BRT [18653]: [22-1] user=,db= LOG: selected new timeline ID: 2
2012-06-14 11:51:09 BRT [18653]: [23-1] user=,db= LOG: archive recovery complete
2012-06-14 11:51:09 BRT [18653]: [24-1] user=,db= WARNING: page 193569 of relation base/16407/80416524 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [25-1] user=,db= WARNING: page 129229 of relation base/16407/38334886 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [26-1] user=,db= WARNING: page 134146 of relation base/16407/38334880 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [27-1] user=,db= WARNING: page 318013 of relation base/16407/38334887 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [28-1] user=,db= WARNING: page 134143 of relation base/16407/38334880 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [29-1] user=,db= WARNING: page 156203 of relation base/16407/38334883 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [30-1] user=,db= WARNING: page 318009 of relation base/16407/38334887 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [31-1] user=,db= WARNING: page 370150 of relation base/16407/38334888 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [32-1] user=,db= WARNING: page 133811 of relation base/16407/38334879 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [33-1] user=,db= WARNING: page 317963 of relation base/16407/38334884 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [34-1] user=,db= WARNING: page 133808 of relation base/16407/38334879 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [35-1] user=,db= WARNING: page 133809 of relation base/16407/38334879 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [36-1] user=,db= WARNING: page 133810 of relation base/16407/38334879 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [37-1] user=,db= WARNING: page 129231 of relation base/16407/38334886 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [38-1] user=,db= WARNING: page 14329 of relation base/16407/80430266 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [39-1] user=,db= WARNING: page 134145 of relation base/16407/38334880 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [40-1] user=,db= WARNING: page 134144 of relation base/16407/38334880 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [41-1] user=,db= WARNING: page 317957 of relation base/16407/38334884 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [42-1] user=,db= WARNING: page 129230 of relation base/16407/38334886 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [43-1] user=,db= WARNING: page 156201 of relation base/16407/38334883 was uninitialized
2012-06-14 11:51:09 BRT [18653]: [44-1] user=,db= PANIC: WAL contains references to invalid pages
2012-06-14 11:51:09 BRT [18651]: [4-1] user=,db= LOG: startup process (PID 18653) was terminated by signal 6: Aborted
2012-06-14 11:51:09 BRT [18651]: [5-1] user=,db= LOG: terminating any other active server processes

PostgreSQL version 9.1.3.

Thanks in advance for any suggestions.

-Matheus

2011/7/1 Fujii Masao <masao.fujii@gmail.com>

On Fri, Jul 1, 2011 at 2:18 PM, Mikko Partio <mpartio@gmail.com> wrote:
> Hello list
> I have two a machine cluster with PostgreSQL 9.0.4 and streaming
> replication. In a normal situation I did a failover -- touched the trigger
> file in standby to promote it to production mode. I have done this
> previously without any complications but now the to-be-production-database
> had a PANIC and shut itself down. From the logs:
> postgres[10751]: [2-1] 2011-06-30 17:25:24 EEST [10751]: [1-1] user=,db=
> LOG: streaming replication successfully connected to primary
> postgres[10736]: [10-1] 2011-07-01 07:50:29 EEST [10736]: [10-1] user=,db=
> LOG: trigger file found: /postgresql/data/finish_replication.trigger
> postgres[10751]: [3-1] 2011-07-01 07:50:29 EEST [10751]: [2-1] user=,db=
> FATAL: terminating walreceiver process due to administrator command
> postgres[10736]: [11-1] 2011-07-01 07:50:30 EEST [10736]: [11-1] user=,db=
> LOG: redo done at AE/8B1855C8
> postgres[10736]: [12-1] 2011-07-01 07:50:30 EEST [10736]: [12-1] user=,db=
> LOG: last completed transaction was at log time 2011-07-01
> 07:50:14.67424+03
> postgres[10736]: [13-1] 2011-07-01 07:50:30 EEST [10736]: [13-1] user=,db=
> LOG: restored log file "00000006000000AE0000008B" from archive
> postgres[10736]: [14-1] 2011-07-01 07:50:30 EEST [10736]: [14-1] user=,db=
> LOG: selected new timeline ID: 7
> postgres[10736]: [15-1] 2011-07-01 07:50:30 EEST [10736]: [15-1] user=,db=
> LOG: restored log file "00000006.history" from archive
> postgres[10736]: [16-1] 2011-07-01 07:50:30 EEST [10736]: [16-1] user=,db=
> LOG: archive recovery complete
> postgres[10736]: [17-1] 2011-07-01 07:50:30 EEST [10736]: [17-1] user=,db=
> WARNING: page 68975 of relation base/3711580/4398155 was uninitialized
> postgres[10736]: [18-1] 2011-07-01 07:50:30 EEST [10736]: [18-1] user=,db=
> WARNING: page 782 of relation base/3711580/4395394 was uninitialized
> postgres[10736]: [19-1] 2011-07-01 07:50:30 EEST [10736]: [19-1] user=,db=
> WARNING: page 68948 of relation base/3711580/4398155 was uninitialized
> postgres[10736]: [20-1] 2011-07-01 07:50:30 EEST [10736]: [20-1] user=,db=
> WARNING: page 68986 of relation base/3711580/4398155 was uninitialized
> [ there are maybe 50 lines of these WARNING messages for all together six
> different relations, all indexes ]
> postgres[10736]: [73-1] 2011-07-01 07:50:31 EEST [10736]: [73-1] user=,db=
> PANIC: WAL contains references to invalid pages
> postgres[10727]: [2-1] 2011-07-01 07:50:31 EEST [10727]: [2-1] user=,db=
> LOG: startup process (PID 10736) was terminated by signal 6: Aborted
> postgres[10727]: [3-1] 2011-07-01 07:50:31 EEST [10727]: [3-1] user=,db=
> LOG: terminating any other active server processes
> Then I started postgres as it was shut down:
> postgres[12191]: [1-1] 2011-07-01 07:50:59 EEST [12191]: [1-1] user=,db=
> LOG: database system was interrupted while in recovery at log time
> 2011-07-01 07:44:02 EEST
> postgres[12191]: [1-2] 2011-07-01 07:50:59 EEST [12191]: [2-1] user=,db=
> HINT: If this has occurred more than once some data might be corrupted and
> you might need to choose an earlier recovery target.
> postgres[12191]: [2-1] 2011-07-01 07:50:59 EEST [12191]: [3-1] user=,db=
> LOG: database system was not properly shut down; automatic recovery in
> progress
> postgres[12191]: [3-1] 2011-07-01 07:50:59 EEST [12191]: [4-1] user=,db=
> LOG: redo starts at AE/8B13E398
> postgres[12191]: [4-1] 2011-07-01 07:50:59 EEST [12191]: [5-1] user=,db=
> LOG: consistent recovery state reached at AE/8C000000
> postgres[12191]: [5-1] 2011-07-01 07:50:59 EEST [12191]: [6-1] user=,db=
> LOG: unexpected pageaddr AE/6E000000 in log file 174, segment 140, offset 0
> postgres[12191]: [6-1] 2011-07-01 07:50:59 EEST [12191]: [7-1] user=,db=
> LOG: redo done at AE/8B1855C8
> postgres[12191]: [7-1] 2011-07-01 07:50:59 EEST [12191]: [8-1] user=,db=
> LOG: last completed transaction was at log time 2011-07-01
> 07:50:14.67424+03
> postgres[12194]: [1-1] 2011-07-01 07:50:59 EEST [12194]: [1-1] user=,db=
> LOG: autovacuum launcher started
> postgres[12189]: [1-1] 2011-07-01 07:50:59 EEST [12189]: [1-1] user=,db=
> LOG: database system is ready to accept connections
> Are these log messages something to worry about?

http://archives.postgresql.org/pgsql-hackers/2008-12/msg01335.php
According to the above thread, this issue seems to exist for a few years,
and be unsolved yet. Could you provide a self-contained test case?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: PANIC while doing failover (streaming replication)

PANIC while doing failover (streaming replication)

Re: PANIC while doing failover (streaming replication)

Re: PANIC while doing failover (streaming replication)

Re: PANIC while doing failover (streaming replication)

Re: [HACKERS] PANIC while doing failover (streaming replication)