Re: Failing start-up archive recovery at Standby mode in PG9.2.4

Поиск
Список
Период
Сортировка
От KONDO Mitsumasa
Тема Re: Failing start-up archive recovery at Standby mode in PG9.2.4
Дата
Msg-id 517899C4.8060701@lab.ntt.co.jp
обсуждение исходный текст
Ответ на Re: Failing start-up archive recovery at Standby mode in PG9.2.4  (Andres Freund <andres@2ndquadrant.com>)
Список pgsql-hackers
Hi,

>> I find problem about failing start-up achive recovery at Standby mode in PG9.2.4 streaming replication.
>> I test same problem in PG9.2.3. But it is not occerd...
>
>>> cp: cannot stat `../arc/000000030000000000000013': そのようなファイルやディレクトリはありません
>>> [Standby] 2013-04-22 01:27:25 EDTLOG:  00000: restored log file "000000020000000000000013" from archive
>
> I can't read the error message here, but this looks suspicious. The
> recovery command seems to be returning success although it prints such
> an error message?
Sorry.. This suspicious massage is written in japanese.
It means "missing file or directory". Beacause starting point of timeline 3 WAL file is 00000003000000000000002A.
>>> cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません
>> [* This point is illegal work]> [Standby] 2013-04-22 01:27:25 EDTLOG:  00000: unexpected pageaddr 0/6000000 in log
file0, segment 20, offset 0 
>>> [Standby] 2013-04-22 01:27:25 EDTLOCATION:  ValidXLOGHeader, xlog.c:4395
>>> cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません
>>> cp: cannot stat `../arc/00000004.history': そのようなファイルやディレクトリはありません
>>> [Master] 2013-04-22 01:27:25 EDTDEBUG:  00000: received replication command: IDENTIFY_SYSTEM
>>> [Master] 2013-04-22 01:27:25 EDTLOCATION:  HandleReplicationCommand, walsender.c:449
>>> [Master] 2013-04-22 01:27:25 EDTDEBUG:  00000: received replication command: START_REPLICATION 0/14000000
>>> [Master] 2013-04-22 01:27:25 EDTLOCATION:  HandleReplicationCommand, walsender.c:449
>>> [Master] 2013-04-22 01:27:25 EDTFATAL:  58P01: requested WAL segment 000000030000000000000014 has already been
removed
>>> [Master] 2013-04-22 01:27:25 EDTLOCATION:  XLogRead, walsender.c:1023
>>> [Standby] 2013-04-22 01:27:25 EDTLOG:  00000: streaming replication successfully connected to primary
>>> [Standby] 2013-04-22 01:27:25 EDTLOCATION:  libpqrcv_connect, libpqwalreceiver.c:171
>>> [Standby] 2013-04-22 01:27:25 EDTFATAL:  XX000: could not receive data from WAL stream: FATAL:  requested WAL
segment000000030000000000000014 has already been removed 
>>>
>>> [Standby] 2013-04-22 01:27:25 EDTLOCATION:  libpqrcv_receive, libpqwalreceiver.c:389
>>>
>>> [mitsu-ko@localhost postgresql-9.2.4]$ bincp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません
>>> [Standby] 2013-04-22 01:27:30 EDTDEBUG:  00000: unexpected pageaddr 0/6000000 in log file 0, segment 20, offset 0
>>> [Standby] 2013-04-22 01:27:30 EDTLOCATION:  ValidXLOGHeader, xlog.c:4395
>>> cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません
>>> cp: cannot stat `../arc/00000004.history': そのようなファイルやディレクトリはありません
>>> [Master] 2013-04-22 01:27:30 EDTDEBUG:  00000: received replication command: IDENTIFY_SYSTEM
>>> [Master] 2013-04-22 01:27:30 EDTLOCATION:  HandleReplicationCommand, walsender.c:449
>>> [Master] 2013-04-22 01:27:30 EDTDEBUG:  00000: received replication command: START_REPLICATION 0/14000000
>>> [Master] 2013-04-22 01:27:30 EDTLOCATION:  HandleReplicationCommand, walsender.c:449
>>> [Master] 2013-04-22 01:27:30 EDTFATAL:  58P01: requested WAL segment 000000030000000000000014 has already been
removed
>>> [Master] 2013-04-22 01:27:30 EDTLOCATION:  XLogRead, walsender.c:1023
>>> [Standby] 2013-04-22 01:27:30 EDTLOG:  00000: streaming replication successfully connected to primary
>>> [Standby] 2013-04-22 01:27:30 EDTLOCATION:  libpqrcv_connect, libpqwalreceiver.c:171
>>> [Standby] 2013-04-22 01:27:30 EDTFATAL:  XX000: could not receive data from WAL stream: FATAL:  requested WAL
segment000000030000000000000014 has already been removed 
>
> And this seems to be the second problem. You probably need to configure
> a higher wal_keep_segments on the primary or you need to fix your
> recovery_command.
In this situation, I try to up timelineID by archive recovery, not also recover WAL files.
If achive recovery works correctly in Standby, recovery done point is 000000020000000000000029 in timelineID 2.
Please see detail in pg9.2.4.log. Situation of archive file is here.
>> We try to fix this problem with my company's colleaguea,
>> but it is too difficult and complicated to WAL contorol and validation method...
>>
>> This problem is occerd in xlog.c:10692.
>>> 10692         if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode, false))
>>> 10693                 goto next_record_is_invalid;
>>
>> And problem is occerd this checking in ValidXLOGHeader().
>>>   4390         if (!XLByteEQ(hdr->xlp_pageaddr, recaddr))
>>>   4391         {
>>>   4392                 ereport(emode_for_corrupt_record(emode, recaddr),
>>>   4393                                 (errmsg("unexpected pageaddr %X/%X in log file %u, segment %u, offset %u",
>>>   4394                                                 hdr->xlp_pageaddr.xlogid, hdr->xlp_pageaddr.xrecoff,
>>>   4395                                                 readId, readSeg, readOff)));
>>>   4396                 return false;
>>>   4397         }
>>
>> We think hdr->xlp_pageaddr.xrecoff has wrong value. This is very rare cace!
>> We cannot analyze why this value is uncorrect!
>
> I think the tests is just noticing that we don't have valid WAL - we are
> probably reading old preallocated wal files from before starting the
> node as a standby. Normally it will get the correct data from the
> primary after that (you can see the START_REPLICATION commands in the
> log), but that failed.
I hope so! I contenue to try to analayze this problem..

Best regards

--
NTT Open Sorce Software Center
Mitsumasa KONDO



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alvaro Herrera
Дата:
Сообщение: Re: danger of stats_temp_directory = /dev/shm
Следующее
От: Jeff Davis
Дата:
Сообщение: Re: Enabling Checksums