Hot strandby fails to restart after pg_subtrans corruption (?)

Поиск
Список
Период
Сортировка
От Paul Guyot
Тема Hot strandby fails to restart after pg_subtrans corruption (?)
Дата
Msg-id FCCF0B65-62EA-4E4C-9E99-10C47FC339A4@ieee.org
обсуждение исходный текст
Список pgsql-general
Hello,

After a hardware reboot, a hot standby server fails to start.
The log mentions pg_subtrans files. While the machine got a hard reboot, underlying filesystem is ZFS which did not
detector report any corruption. 

For the first restart after the reboot, the server (then 9.1.0) complained that a file did not exist.
May 10 14:16:49 citrinus postgres[1914]: [6-1] LOG:  file "pg_subtrans/B7E5" doesn't exist, reading as zeroes

Considering this may be related to this thread :
http://archives.postgresql.org/pgsql-general/2011-12/msg00291.php

postgres was cleanly restarted during the recovery :
May 10 14:22:30 citrinus postgres[1489]: [1-1] LOG:  received fast shutdown request
May 10 14:22:30 citrinus postgres[1916]: [1-1] LOG:  shutting down
May 10 14:22:31 citrinus postgres[1916]: [2-1] LOG:  database system is shut down

and then updated to 9.1.3.

Now, the server just fails later on the same file.

----
May 10 15:48:05 citrinus postgres[39203]: [1-1] LOG:  database system was interrupted while in recovery at log time
2012-04-2507:53:33 UTC 
May 10 15:48:05 citrinus postgres[39203]: [1-2] HINT:  If this has occurred more than once some data might be corrupted
andyou might need to choose an earlier recovery target. 
May 10 15:48:05 citrinus postgres[39203]: [2-1] LOG:  entering standby mode
May 10 15:48:07 citrinus postgres[39203]: [3-1] LOG:  restored log file "00000001000006D600000017" from archive
May 10 15:48:10 citrinus postgres[39203]: [4-1] LOG:  restored log file "00000001000006D500000068" from archive
May 10 15:48:10 citrinus postgres[39203]: [5-1] LOG:  redo starts at 6D5/684B13E8
May 10 15:49:47 citrinus postgres[39203]: [6-1] LOG:  restored log file "00000001000006D500000069" from archive
May 10 15:50:37 citrinus postgres[39203]: [7-1] LOG:  restored log file "00000001000006D50000006A" from archive
May 10 15:51:19 citrinus postgres[39203]: [8-1] LOG:  restored log file "00000001000006D50000006B" from archive
May 10 15:52:05 citrinus postgres[39203]: [9-1] LOG:  restored log file "00000001000006D50000006C" from archive
May 10 15:53:06 citrinus postgres[39203]: [10-1] LOG:  restored log file "00000001000006D50000006D" from archive
May 10 15:53:24 citrinus postgres[39203]: [11-1] FATAL:  could not access status of transaction 3085299721
May 10 15:53:24 citrinus postgres[39203]: [11-2] DETAIL:  Could not read from file "pg_subtrans/B7E5" at offset 245760:
Noerror: 0. 
May 10 15:53:24 citrinus postgres[39203]: [11-3] CONTEXT:  xlog redo xid assignment xtop 3085293107: subxacts:
30852994493085299451 3085299462 3085299465 3085299469 3085299473 3085299478 3085299485 3085299487 3085299489 3085299491
30852994943085299497 3085299498 3085299543 3085299545 3085299547 3085299549 3085299551 3085299553 3085299555 3085299557
30852995583085299559 3085299560 3085299561 3085299562 3085299563 3085299564 3085299567 3085299568 3085299569 3085299570
30852995713085299573 3085299574 3085299576 3085299578 3085299580 3085299582 3085299594 3085299596 3085299607 3085299608
30852996113085299635 3085299646 3085299648 3085299650 3085299652 3085299654 3085299656 3085299658 3085299660 3085299663
30852996653085299667 3085299669 3085299671 3085299673 3085299681 3085299683 3085299687 3085299721 
May 10 15:53:24 citrinus postgres[39202]: [1-1] LOG:  startup process (PID 39203) exited with exit code 1
May 10 15:53:24 citrinus postgres[39202]: [2-1] LOG:  terminating any other active server processes
----

File B7E5 has exactly 245760 bytes.

If I remove the B7E5 file as well as file B7F6 which was created during recovery, and retry, it fails later on like
this:

----
May 10 16:15:53 citrinus postgres[41142]: [75-1] LOG:  file "pg_subtrans/B7E6" doesn't exist, reading as zeroes
May 10 16:15:53 citrinus postgres[41142]: [75-2] CONTEXT:  xlog redo xid assignment xtop 3085346439: subxacts:
30853643073085364352 3085364355 3085364358 3085364363 3085364404 3085364407 3085364409 3085364496 3085364518 3085364532
30853645683085364571 3085364611 3085364625 3085364660 3085364663 3085364668 3085364671 3085364715 3085364719 3085364727
30853647383085364740 3085364768 3085364770 3085364773 3085364792 3085364800 3085364803 3085364806 3085364817 3085364819
30853648293085364830 3085364832 3085364854 3085364874 3085364877 3085364885 3085364886 3085364889 3085364890 3085364891
30853649113085364914 3085364918 3085364920 3085364964 3085364988 3085364998 3085365001 3085365013 3085365014 3085365015
30853650163085365068 3085365069 3085365071 3085365106 3085365183 3085365284 3085365287 3085365288 
May 10 16:16:10 citrinus postgres[41142]: [76-1] LOG:  restored log file "00000001000006D500000077" from archive
May 10 16:16:13 citrinus postgres[41142]: [77-1] FATAL:  could not access status of transaction 3085367299
May 10 16:16:13 citrinus postgres[41142]: [77-2] DETAIL:  Could not read from file "pg_subtrans/B7E6" at offset 253952:
Noerror: 0. 
May 10 16:16:13 citrinus postgres[41142]: [77-3] CONTEXT:  xlog redo xid assignment xtop 3085346439: subxacts:
30853668203085366821 3085366822 3085366823 3085366824 3085366825 3085366826 3085366827 3085366828 3085366829 3085366830
30853668313085366832 3085366833 3085366834 3085366835 3085366836 3085366837 3085366838 3085366860 3085366905 3085366919
30853669203085366949 3085366965 3085367040 3085367057 3085367059 3085367062 3085367076 3085367168 3085367170 3085367174
30853671773085367180 3085367182 3085367183 3085367184 3085367185 3085367202 3085367205 3085367208 3085367212 3085367215
30853672643085367266 3085367269 3085367272 3085367274 3085367277 3085367299 3085367301 3085367305 3085367309 3085367311
30853673313085367334 3085367337 3085367338 3085367339 3085367341 3085367343 3085367345 3085367351 
May 10 16:16:13 citrinus postgres[41141]: [1-1] LOG:  startup process (PID 41142) exited with exit code 1
May 10 16:16:13 citrinus postgres[41141]: [2-1] LOG:  terminating any other active server processes
----

At this point, file B7E6 was created and is 8KB long.
-rw-------  1 pgsql  wheel    8192 May 10 16:16 B7E6

What are the steps to recover the standby?

Paul
--
Semiocast            http://semiocast.com/
Semioboard          http://semioboard.com/
+33.183627948 - 20 rue Lacaze, 75014 Paris


Вложения

В списке pgsql-general по дате отправления:

Предыдущее
От: Horaci Macias
Дата:
Сообщение: Re: vacuum, vacuum full and problems releasing disk space
Следующее
От: Evan Carroll
Дата:
Сообщение: Custom Domain; migration from 8.4 to 9.1 and COLLATE