Re: BUG #13822: Slave terminated - WAL contains references to invalid page

Поиск
Список
Период
Сортировка
От
Тема Re: BUG #13822: Slave terminated - WAL contains references to invalid page
Дата
Msg-id 1451506840099.98963@tieto.com
обсуждение исходный текст
Ответ на Re: BUG #13822: Slave terminated - WAL contains references to invalid page  (Andres Freund <andres@anarazel.de>)
Ответы Re: BUG #13822: Slave terminated - WAL contains references to invalid page  (Michael Paquier <michael.paquier@gmail.com>)
Список pgsql-bugs
Hi,=0A=
=0A=
I created new replica (original one crashed 2015-12-15 13:05:39 CET) via: =
=0A=
=0A=
$ pg_basebackup -h <master_IP> -D $PGDATA -R -P -U replication --xlog-metho=
d=3Dstream=0A=
$ cat $PGDATA/recovery.conf =0A=
standby_mode =3D 'on'=0A=
primary_conninfo =3D 'user=3Dreplication password=3D*** host=3D<master_IP> =
port=3D5432 sslmode=3Dprefer sslcompression=3D1 krbsrvname=3Dpostgres'=0A=
# service postgresql-9.4 start=0A=
=0A=
This new replica crashed 2015-12-22 00:25:11 CET.=0A=
=0A=
Ordinary base backups are done by:=0A=
=0A=
/usr/pgsql-9.4/bin/pg_basebackup -h <master_IP> -D <local_pg_basebackup_dir=
> -Ft -z -v -Xf -U postgres=0A=
=0A=
=0A=
Running Master pg_controldata:=0A=
=0A=
pg_control version number:            942=0A=
Catalog version number:               201409291=0A=
Database system identifier:           6225208935473175762=0A=
Database cluster state:               in production=0A=
pg_control last modified:             Wed 30 Dec 2015 08:00:02 PM CET=0A=
Latest checkpoint location:           191/2E08DC90=0A=
Prior checkpoint location:            191/2E0829E0=0A=
Latest checkpoint's REDO location:    191/2E08DC58=0A=
Latest checkpoint's REDO WAL file:    00000001000001910000002E=0A=
Latest checkpoint's TimeLineID:       1=0A=
Latest checkpoint's PrevTimeLineID:   1=0A=
Latest checkpoint's full_page_writes: on=0A=
Latest checkpoint's NextXID:          0/370154188=0A=
Latest checkpoint's NextOID:          4803404=0A=
Latest checkpoint's NextMultiXactId:  161565=0A=
Latest checkpoint's NextMultiOffset:  410484=0A=
Latest checkpoint's oldestXID:        187316348=0A=
Latest checkpoint's oldestXID's DB:   16422=0A=
Latest checkpoint's oldestActiveXID:  370154188=0A=
Latest checkpoint's oldestMultiXid:   1=0A=
Latest checkpoint's oldestMulti's DB: 16422=0A=
Time of latest checkpoint:            Wed 30 Dec 2015 08:00:01 PM CET=0A=
Fake LSN counter for unlogged rels:   0/1=0A=
Minimum recovery ending location:     0/0=0A=
Min recovery ending loc's timeline:   0=0A=
Backup start location:                0/0=0A=
Backup end location:                  0/0=0A=
End-of-backup record required:        no=0A=
Current wal_level setting:            hot_standby=0A=
Current wal_log_hints setting:        off=0A=
Current max_connections setting:      600=0A=
Current max_worker_processes setting: 8=0A=
Current max_prepared_xacts setting:   0=0A=
Current max_locks_per_xact setting:   64=0A=
Maximum data alignment:               8=0A=
Database block size:                  8192=0A=
Blocks per segment of large relation: 131072=0A=
WAL block size:                       8192=0A=
Bytes per WAL segment:                16777216=0A=
Maximum length of identifiers:        64=0A=
Maximum columns in an index:          32=0A=
Maximum size of a TOAST chunk:        1996=0A=
Size of a large-object chunk:         2048=0A=
Date/time type storage:               64-bit integers=0A=
Float4 argument passing:              by value=0A=
Float8 argument passing:              by value=0A=
Data page checksum version:           0=0A=
=0A=
=0A=
Running Slave pg_controldata:=0A=
=0A=
pg_control version number:            942=0A=
Catalog version number:               201409291=0A=
Database system identifier:           6225208935473175762=0A=
Database cluster state:               in archive recovery=0A=
pg_control last modified:             Wed 30 Dec 2015 08:00:40 PM CET=0A=
Latest checkpoint location:           191/2E08DC90=0A=
Prior checkpoint location:            191/2E0829E0=0A=
Latest checkpoint's REDO location:    191/2E08DC58=0A=
Latest checkpoint's REDO WAL file:    00000001000001910000002E=0A=
Latest checkpoint's TimeLineID:       1=0A=
Latest checkpoint's PrevTimeLineID:   1=0A=
Latest checkpoint's full_page_writes: on=0A=
Latest checkpoint's NextXID:          0/370154188=0A=
Latest checkpoint's NextOID:          4803404=0A=
Latest checkpoint's NextMultiXactId:  161565=0A=
Latest checkpoint's NextMultiOffset:  410484=0A=
Latest checkpoint's oldestXID:        187316348=0A=
Latest checkpoint's oldestXID's DB:   16422=0A=
Latest checkpoint's oldestActiveXID:  370154188=0A=
Latest checkpoint's oldestMultiXid:   1=0A=
Latest checkpoint's oldestMulti's DB: 16422=0A=
Time of latest checkpoint:            Wed 30 Dec 2015 08:00:01 PM CET=0A=
Fake LSN counter for unlogged rels:   0/1=0A=
Minimum recovery ending location:     191/2E093C68=0A=
Min recovery ending loc's timeline:   1=0A=
Backup start location:                0/0=0A=
Backup end location:                  0/0=0A=
End-of-backup record required:        no=0A=
Current wal_level setting:            hot_standby=0A=
Current wal_log_hints setting:        off=0A=
Current max_connections setting:      600=0A=
Current max_worker_processes setting: 8=0A=
Current max_prepared_xacts setting:   0=0A=
Current max_locks_per_xact setting:   64=0A=
Maximum data alignment:               8=0A=
Database block size:                  8192=0A=
Blocks per segment of large relation: 131072=0A=
WAL block size:                       8192=0A=
Bytes per WAL segment:                16777216=0A=
Maximum length of identifiers:        64=0A=
Maximum columns in an index:          32=0A=
Maximum size of a TOAST chunk:        1996=0A=
Size of a large-object chunk:         2048=0A=
Date/time type storage:               64-bit integers=0A=
Float4 argument passing:              by value=0A=
Float8 argument passing:              by value=0A=
Data page checksum version:           0=0A=
=0A=
=0A=
Slave pg_controldata after second PITR attempt:=0A=
=0A=
pg_control version number:            942=0A=
Catalog version number:               201409291=0A=
Database system identifier:           6225208935473175762=0A=
Database cluster state:               shut down in recovery=0A=
pg_control last modified:             Wed 30 Dec 2015 08:53:51 PM CET=0A=
Latest checkpoint location:           187/9DA56658=0A=
Prior checkpoint location:            186/D7000060=0A=
Latest checkpoint's REDO location:    187/9D647E30=0A=
Latest checkpoint's REDO WAL file:    00000001000001870000009D=0A=
Latest checkpoint's TimeLineID:       1=0A=
Latest checkpoint's PrevTimeLineID:   1=0A=
Latest checkpoint's full_page_writes: on=0A=
Latest checkpoint's NextXID:          0/367008737=0A=
Latest checkpoint's NextOID:          4721484=0A=
Latest checkpoint's NextMultiXactId:  160851=0A=
Latest checkpoint's NextMultiOffset:  409011=0A=
Latest checkpoint's oldestXID:        187316348=0A=
Latest checkpoint's oldestXID's DB:   16422=0A=
Latest checkpoint's oldestActiveXID:  367008737=0A=
Latest checkpoint's oldestMultiXid:   1=0A=
Latest checkpoint's oldestMulti's DB: 16422=0A=
Time of latest checkpoint:            Tue 22 Dec 2015 12:21:40 AM CET=0A=
Fake LSN counter for unlogged rels:   0/1=0A=
Minimum recovery ending location:     187/9FECB9A0=0A=
Min recovery ending loc's timeline:   1=0A=
Backup start location:                0/0=0A=
Backup end location:                  0/0=0A=
End-of-backup record required:        no=0A=
Current wal_level setting:            hot_standby=0A=
Current wal_log_hints setting:        off=0A=
Current max_connections setting:      600=0A=
Current max_worker_processes setting: 8=0A=
Current max_prepared_xacts setting:   0=0A=
Current max_locks_per_xact setting:   64=0A=
Maximum data alignment:               8=0A=
Database block size:                  8192=0A=
Blocks per segment of large relation: 131072=0A=
WAL block size:                       8192=0A=
Bytes per WAL segment:                16777216=0A=
Maximum length of identifiers:        64=0A=
Maximum columns in an index:          32=0A=
Maximum size of a TOAST chunk:        1996=0A=
Size of a large-object chunk:         2048=0A=
Date/time type storage:               64-bit integers=0A=
Float4 argument passing:              by value=0A=
Float8 argument passing:              by value=0A=
Data page checksum version:           0=0A=
=0A=
=0A=
# cat backup_label.old =0A=
START WAL LOCATION: 186/D7000028 (file 0000000100000186000000D7)=0A=
CHECKPOINT LOCATION: 186/D7000060=0A=
BACKUP METHOD: streamed=0A=
BACKUP FROM: master=0A=
START TIME: 2015-12-21 19:00:03 CET=0A=
LABEL: pg_basebackup base backup=0A=
=0A=
=0A=
Slave's pg_controldata above after PITR has been created after second recov=
ery attempt.=0A=
To be honest I don't understand why this second PITR finished without PANIC=
 when previous attempt from the same base backup and archived wals failed.=
=0A=
See logs details below:=0A=
=0A=
First (crashed) PITR:=0A=
=0A=
2015-12-30 08:31:19 CET @  LOG:  database system was interrupted; last know=
n up at 2015-12-21 20:15:02 CET=0A=
2015-12-30 08:31:19 CET @  LOG:  starting point-in-time recovery to 2015-12=
-22 00:30:00+01=0A=
2015-12-30 08:31:19 CET @  LOG:  database system was not properly shut down=
; automatic recovery in progress=0A=
2015-12-30 08:31:19 CET @  LOG:  redo starts at 186/D7260540=0A=
2015-12-30 08:31:19 CET @  LOG:  consistent recovery state reached at 186/D=
8000000=0A=
2015-12-30 08:31:19 CET @  LOG:  database system is ready to accept read on=
ly connections=0A=
2015-12-30 08:31:19 CET @  LOG:  restored log file "0000000100000186000000D=
8" from archive=0A=
2015-12-30 08:31:20 CET @  LOG:  restored log file "0000000100000186000000D=
9" from archive=0A=
/.../=0A=
2015-12-30 08:32:48 CET @  LOG:  restored log file "00000001000001870000009=
D" from archive=0A=
2015-12-30 08:32:50 CET @  LOG:  restored log file "00000001000001870000009=
E" from archive=0A=
2015-12-30 08:32:51 CET @  WARNING:  page 73172 of relation base/16422/2325=
3 is uninitialized=0A=
2015-12-30 08:32:51 CET @  CONTEXT:  xlog redo visible: rel 1663/16422/2325=
3; blk 73172=0A=
2015-12-30 08:32:51 CET @  PANIC:  WAL contains references to invalid pages=
=0A=
2015-12-30 08:32:51 CET @  CONTEXT:  xlog redo visible: rel 1663/16422/2325=
3; blk 73172=0A=
2015-12-30 08:32:51 CET @  LOG:  startup process (PID 29324) was terminated=
 by signal 6: Aborted=0A=
2015-12-30 08:32:51 CET @  LOG:  terminating any other active server proces=
ses=0A=
=0A=
For this one unfortunatelly I don't have pg_controldata.=0A=
=0A=
Second one:=0A=
=0A=
2015-12-30 20:42:39 CET @  LOG:  database system was interrupted; last know=
n up at 2015-12-21 20:15:02 CET=0A=
2015-12-30 20:42:39 CET @  LOG:  starting point-in-time recovery to 2015-12=
-22 00:30:00+01=0A=
2015-12-30 20:42:39 CET @  LOG:  restored log file "0000000100000186000000D=
7" from archive=0A=
2015-12-30 20:42:39 CET @  LOG:  redo starts at 186/D7000028=0A=
2015-12-30 20:42:39 CET @  LOG:  consistent recovery state reached at 186/D=
7265F80=0A=
2015-12-30 20:42:39 CET @  LOG:  database system is ready to accept read on=
ly connections=0A=
2015-12-30 20:42:40 CET @  LOG:  restored log file "0000000100000186000000D=
8" from archive=0A=
2015-12-30 20:42:40 CET @  LOG:  restored log file "0000000100000186000000D=
9" from archive=0A=
/.../=0A=
2015-12-30 20:44:03 CET @  LOG:  restored log file "00000001000001870000009=
D" from archive=0A=
2015-12-30 20:44:05 CET @  LOG:  restored log file "00000001000001870000009=
E" from archive=0A=
2015-12-30 20:44:08 CET @  LOG:  restored log file "00000001000001870000009=
F" from archive=0A=
2015-12-30 20:44:10 CET @  LOG:  recovery stopping before commit of transac=
tion 367008747, time 2015-12-22 00:30:01.876776+01=0A=
2015-12-30 20:44:10 CET @  LOG:  recovery has paused=0A=
2015-12-30 20:44:10 CET @  HINT:  Execute pg_xlog_replay_resume() to contin=
ue.=0A=
=0A=
Regards=0A=
Marek=0A=
________________________________________=0A=
From: Andres Freund <andres@anarazel.de>=0A=
Sent: Wednesday, December 30, 2015 12:49 PM=0A=
To: Petr Marek=0A=
Cc: michael.paquier@gmail.com; pgsql-bugs@postgresql.org=0A=
Subject: Re: [BUGS] BUG #13822: Slave terminated - WAL contains references =
to invalid page=0A=
=0A=
Hi,=0A=
=0A=
On 2015-12-29 15:51:32 +0000, Marek.Petr@tieto.com wrote:=0A=
> I used fresh base backup for slave after both crashes.=0A=
> Also tried to scan archived wals several hours before last crash and foun=
d only the following for string 71566:=0A=
>=0A=
> rmgr: Heap2       len (rec/tot):     20/    52, tx:          0, lsn: 187/=
987859D8, prev 187/987859A0, bkp: 0000, desc: visible: rel 1663/16422/17216=
; blk 71566=0A=
> rmgr: Heap2       len (rec/tot):     20/    52, tx:          0, lsn: 187/=
9CC59020, prev 187/9CC58FE8, bkp: 0000, desc: visible: rel 1663/16422/17220=
; blk 71566=0A=
> rmgr: Heap2       len (rec/tot):     20/    52, tx:          0, lsn: 187/=
9E356D98, prev 187/9E356D60, bkp: 0000, desc: visible: rel 1663/16422/23253=
; blk 71566=0A=
=0A=
Could you detail how you take base backups and how you set them up as=0A=
new replicas? Master's pg_controldata, the crashed standby's and the=0A=
standby's backup_label.old would be helpful.=0A=
=0A=
Greetings,=0A=
=0A=
Andres Freund=

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Ali Raza
Дата:
Сообщение: Re: BUG #13839: command line restart server fail
Следующее
От: kou@clear-code.com
Дата:
Сообщение: BUG #13840: pg_dump generates unloadable SQL when third party string type index option is used