Failing start-up archive recovery at Standby mode in PG9.2.4

Поиск
Список
Период
Сортировка
От KONDO Mitsumasa
Тема Failing start-up archive recovery at Standby mode in PG9.2.4
Дата
Msg-id 51779B3B.1020003@lab.ntt.co.jp
обсуждение исходный текст
Ответы Re: Failing start-up archive recovery at Standby mode in PG9.2.4  (Andres Freund <andres@2ndquadrant.com>)
Список pgsql-hackers
Hi,

I find problem about failing start-up achive recovery at Standby mode in PG9.2.4 streaming replication.
I test same problem in PG9.2.3. But it is not occerd...

We search this problem patch in PostgreSQL git repository.
It may be occerd in this commit.
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=ee8b95e8c807f31220f31bcff445d1dcb20c28b4

Attached sciript(replication_test.sh) is reproduced this problem.
If you use this script, you sholud set up about PGPATH and install pgbench.
Script execution summary is under following:

 1. Create Master and Standby instance.
 2. Insert Master to lot of record using pgbench and stop master by immediate shutdown.
 3. Start Master and Standby.
 4. Standby fails start-up achive recovery in ValidXLOGHeader, xlog.c:4395.
 5. Standby thinks achive recovery is finished. It is mistake!
 6. Standby connects to Master for stream replication, but it fail.

> cp: cannot stat `../arc/000000030000000000000013': そのようなファイルやディレクトリはありません
> [Standby] 2013-04-22 01:27:25 EDTLOG:  00000: restored log file "000000020000000000000013" from archive
> [Standby] 2013-04-22 01:27:25 EDTLOCATION:  RestoreArchivedFile, xlog.c:3273
> [Standby] 2013-04-22 01:27:25 EDTDEBUG:  00000: got WAL segment from archive
> [Standby] 2013-04-22 01:27:25 EDTLOCATION:  XLogFileReadAnyTLI, xlog.c:2930
> [Standby] 2013-04-22 01:27:25 EDTDEBUG:  00000: compacted fsync request queue from 4096 entries to 2 entries
> [Standby] 2013-04-22 01:27:25 EDTCONTEXT:  xlog redo multi-insert (init): rel 1663/12870/24588; blk 41575; 61 tuples
> [Standby] 2013-04-22 01:27:25 EDTLOCATION:  CompactCheckpointerRequestQueue, checkpointer.c:1288
> [Standby] 2013-04-22 01:27:25 EDTDEBUG:  00000: checkpoint sync: number=1 file=base/12870/24588 time=822.092 msec
> [Standby] 2013-04-22 01:27:25 EDTLOCATION:  mdsync, md.c:1161
> [Standby] 2013-04-22 01:27:25 EDTDEBUG:  00000: checkpoint sync: number=2 file=base/12870/24588_fsm time=0.908 msec
> [Standby] 2013-04-22 01:27:25 EDTLOCATION:  mdsync, md.c:1161
> [Standby] 2013-04-22 01:27:25 EDTLOG:  00000: restartpoint complete: wrote 6 buffers (0.1%); 0 transaction log
file(s)added, 3 removed, 4 recycled; write=0.437 s, sync=0.823 s, total=1.274 s; sync files=2, longest=0.822 s,
average=0.411s 
> [Standby] 2013-04-22 01:27:25 EDTLOCATION:  LogCheckpointEnd, xlog.c:7893
> [Standby] 2013-04-22 01:27:25 EDTLOG:  00000: recovery restart point at 0/F0FF8D0
> [Standby] 2013-04-22 01:27:25 EDTDETAIL:  last completed transaction was at log time 2013-04-22 01:26:32.439575-04
> [Standby] 2013-04-22 01:27:25 EDTLOCATION:  CreateRestartPoint, xlog.c:8601
> [Standby] 2013-04-22 01:27:25 EDTLOG:  00000: restartpoint starting: xlog
> [Standby] 2013-04-22 01:27:25 EDTLOCATION:  LogCheckpointStart, xlog.c:7821
> cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません
[* This point is illegal work]> [Standby] 2013-04-22 01:27:25 EDTLOG:  00000: unexpected pageaddr 0/6000000 in log file
0,segment 20, offset 0 
> [Standby] 2013-04-22 01:27:25 EDTLOCATION:  ValidXLOGHeader, xlog.c:4395
> cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません
> cp: cannot stat `../arc/00000004.history': そのようなファイルやディレクトリはありません
> [Master] 2013-04-22 01:27:25 EDTDEBUG:  00000: received replication command: IDENTIFY_SYSTEM
> [Master] 2013-04-22 01:27:25 EDTLOCATION:  HandleReplicationCommand, walsender.c:449
> [Master] 2013-04-22 01:27:25 EDTDEBUG:  00000: received replication command: START_REPLICATION 0/14000000
> [Master] 2013-04-22 01:27:25 EDTLOCATION:  HandleReplicationCommand, walsender.c:449
> [Master] 2013-04-22 01:27:25 EDTFATAL:  58P01: requested WAL segment 000000030000000000000014 has already been
removed
> [Master] 2013-04-22 01:27:25 EDTLOCATION:  XLogRead, walsender.c:1023
> [Standby] 2013-04-22 01:27:25 EDTLOG:  00000: streaming replication successfully connected to primary
> [Standby] 2013-04-22 01:27:25 EDTLOCATION:  libpqrcv_connect, libpqwalreceiver.c:171
> [Standby] 2013-04-22 01:27:25 EDTFATAL:  XX000: could not receive data from WAL stream: FATAL:  requested WAL segment
000000030000000000000014has already been removed 
>
> [Standby] 2013-04-22 01:27:25 EDTLOCATION:  libpqrcv_receive, libpqwalreceiver.c:389
>
> [mitsu-ko@localhost postgresql-9.2.4]$ bincp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません
> [Standby] 2013-04-22 01:27:30 EDTDEBUG:  00000: unexpected pageaddr 0/6000000 in log file 0, segment 20, offset 0
> [Standby] 2013-04-22 01:27:30 EDTLOCATION:  ValidXLOGHeader, xlog.c:4395
> cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません
> cp: cannot stat `../arc/00000004.history': そのようなファイルやディレクトリはありません
> [Master] 2013-04-22 01:27:30 EDTDEBUG:  00000: received replication command: IDENTIFY_SYSTEM
> [Master] 2013-04-22 01:27:30 EDTLOCATION:  HandleReplicationCommand, walsender.c:449
> [Master] 2013-04-22 01:27:30 EDTDEBUG:  00000: received replication command: START_REPLICATION 0/14000000
> [Master] 2013-04-22 01:27:30 EDTLOCATION:  HandleReplicationCommand, walsender.c:449
> [Master] 2013-04-22 01:27:30 EDTFATAL:  58P01: requested WAL segment 000000030000000000000014 has already been
removed
> [Master] 2013-04-22 01:27:30 EDTLOCATION:  XLogRead, walsender.c:1023
> [Standby] 2013-04-22 01:27:30 EDTLOG:  00000: streaming replication successfully connected to primary
> [Standby] 2013-04-22 01:27:30 EDTLOCATION:  libpqrcv_connect, libpqwalreceiver.c:171
> [Standby] 2013-04-22 01:27:30 EDTFATAL:  XX000: could not receive data from WAL stream: FATAL:  requested WAL segment
000000030000000000000014has already been removed 
>
> [Standby] 2013-04-22 01:27:30 EDTLOCATION:  libpqrcv_receive, libpqwalreceiver.c:389
> /pg_ctl -D Standby cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません
> [Standby] 2013-04-22 01:27:35 EDTDEBUG:  00000: unexpected pageaddr 0/6000000 in log file 0, segment 20, offset 0
> [Standby] 2013-04-22 01:27:35 EDTLOCATION:  ValidXLOGHeader, xlog.c:4395
> cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません
> cp: cannot stat `../arc/00000004.history': そのようなファイルやディレクトリはありません
> [Master] 2013-04-22 01:27:35 EDTDEBUG:  00000: received replication command: IDENTIFY_SYSTEM
> [Master] 2013-04-22 01:27:35 EDTLOCATION:  HandleReplicationCommand, walsender.c:449
> [Master] 2013-04-22 01:27:35 EDTDEBUG:  00000: received replication command: START_REPLICATION 0/14000000
> [Master] 2013-04-22 01:27:35 EDTLOCATION:  HandleReplicationCommand, walsender.c:449
> [Master] 2013-04-22 01:27:35 EDTFATAL:  58P01: requested WAL segment 000000030000000000000014 has already been
removed

We try to fix this problem with my company's colleaguea,
but it is too difficult and complicated to WAL contorol and validation method...

This problem is occerd in xlog.c:10692.
> 10692         if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode, false))
> 10693                 goto next_record_is_invalid;

And problem is occerd this checking in ValidXLOGHeader().
>  4390         if (!XLByteEQ(hdr->xlp_pageaddr, recaddr))
>  4391         {
>  4392                 ereport(emode_for_corrupt_record(emode, recaddr),
>  4393                                 (errmsg("unexpected pageaddr %X/%X in log file %u, segment %u, offset %u",
>  4394                                                 hdr->xlp_pageaddr.xlogid, hdr->xlp_pageaddr.xrecoff,
>  4395                                                 readId, readSeg, readOff)));
>  4396                 return false;
>  4397         }

We think hdr->xlp_pageaddr.xrecoff has wrong value. This is very rare cace!
We cannot analyze why this value is uncorrect!


Please fix this problem with us!

Best regards

--
NTT OSS Center
Mitsumasa KONDO

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Heikki Linnakangas
Дата:
Сообщение: Re: Fast promotion, loose ends
Следующее
От: Simon Riggs
Дата:
Сообщение: Re: Fast promotion, loose ends