[BUGS] Standby corruption after master is restarted

Поиск
Список
Период
Сортировка
От Emre Hasegeli
Тема [BUGS] Standby corruption after master is restarted
Дата
Msg-id CAE2gYzzVZNsGn=-E6grO4sVQs04J02zNKQofQEO8gu8=qCFR+Q@mail.gmail.com
обсуждение исходный текст
Ответы Re: [BUGS] Standby corruption after master is restarted  (Andres Freund <andres@anarazel.de>)
Re: Standby corruption after master is restarted  (Emre Hasegeli <emre@hasegeli.com>)
Список pgsql-bugs
One of our standby servers on a testing environment at InnoGames broke
today.  The master and the other standby server is working fine.  I
noticed that it happened right after the master had restarted.  It
looks to me like the WAL file is corrupted.  I wanted to check, if it
can be a bug.  These were on the log file:

> 2017-08-24 12:19:51 UTC [20351-2] LOG:  replication terminated by primary server
> 2017-08-24 12:19:51 UTC [20351-3] DETAIL:  End of WAL reached on timeline 1 at 0/9B9F5FF0.
> 2017-08-24 12:19:51 UTC [20351-4] FATAL:  could not send end-of-streaming message to primary: no COPY in progress
> 2017-08-24 12:19:51 UTC [26673-7] FATAL:  invalid memory alloc request size 1298694144
> 2017-08-24 12:19:51 UTC [26672-3] LOG:  startup process (PID 26673) exited with exit code 1
> 2017-08-24 12:19:51 UTC [26672-4] LOG:  terminating any other active server processes

I tried to start it again to get more logs:

> 2017-08-24 14:07:46 UTC [28861-12698] LOCATION:  KnownAssignedXidsRemovePreceding, procarray.c:3738
> 2017-08-24 14:07:46 UTC [28861-12699] DEBUG:  00000: transaction ID wrap limit is 2147484807, limited by database
withOID 1 
> 2017-08-24 14:07:46 UTC [28861-12700] CONTEXT:  xlog redo at 0/9B9F5F80 for XLOG/CHECKPOINT_SHUTDOWN: redo
0/9B9F5F80;tli 1; prev tli 1; fpw true; xid 0:5869245; oid 20756; multi 1; offset 0; oldest xid 1160 in DB 1; oldest
multi1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown 
> 2017-08-24 14:07:46 UTC [28861-12701] LOCATION:  SetTransactionIdLimit, varsup.c:347
> 2017-08-24 14:07:46 UTC [28861-12702] DEBUG:  00000: prune KnownAssignedXids to 5869245
> 2017-08-24 14:07:46 UTC [28861-12703] CONTEXT:  xlog redo at 0/9B9F5F80 for XLOG/CHECKPOINT_SHUTDOWN: redo
0/9B9F5F80;tli 1; prev tli 1; fpw true; xid 0:5869245; oid 20756; multi 1; offset 0; oldest xid 1160 in DB 1; oldest
multi1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown 
> 2017-08-24 14:07:46 UTC [28861-12704] LOCATION:  KnownAssignedXidsRemovePreceding, procarray.c:3738
> 2017-08-24 14:07:46 UTC [28861-12705] FATAL:  XX000: invalid memory alloc request size 1298694144
> 2017-08-24 14:07:46 UTC [28861-12706] LOCATION:  palloc_extended, mcxt.c:961

These from pg_controldata:

> pg_control version number:            960
> Catalog version number:               201608131
> Database system identifier:           6423674571818176107
> Database cluster state:               in archive recovery
> pg_control last modified:             Thu 24 Aug 2017 02:01:42 PM UTC
> Latest checkpoint location:           0/9B9C1188
> Prior checkpoint location:            0/9B9C1188
> Latest checkpoint's REDO location:    0/9B9BEC60
> Latest checkpoint's REDO WAL file:    00000001000000000000009B
> Latest checkpoint's TimeLineID:       1
> Latest checkpoint's PrevTimeLineID:   1
> Latest checkpoint's full_page_writes: on
> Latest checkpoint's NextXID:          0:5868410
> Latest checkpoint's NextOID:          28526
> Latest checkpoint's NextMultiXactId:  1
> Latest checkpoint's NextMultiOffset:  0
> Latest checkpoint's oldestXID:        1160
> Latest checkpoint's oldestXID's DB:   1
> Latest checkpoint's oldestActiveXID:  5868410
> Latest checkpoint's oldestMultiXid:   1
> Latest checkpoint's oldestMulti's DB: 1
> Latest checkpoint's oldestCommitTsXid:0
> Latest checkpoint's newestCommitTsXid:0
> Time of latest checkpoint:            Thu 24 Aug 2017 12:02:56 PM UTC
> Fake LSN counter for unlogged rels:   0/1
> Minimum recovery ending location:     0/9B9E13C0
> Min recovery ending loc's timeline:   1
> Backup start location:                0/0
> Backup end location:                  0/0
> End-of-backup record required:        no
> wal_level setting:                    replica
> wal_log_hints setting:                on
> max_connections setting:              400
> max_worker_processes setting:         8
> max_prepared_xacts setting:           0
> max_locks_per_xact setting:           64
> track_commit_timestamp setting:       off
> Maximum data alignment:               8
> Database block size:                  8192
> Blocks per segment of large relation: 131072
> WAL block size:                       8192
> Bytes per WAL segment:                16777216
> Maximum length of identifiers:        64
> Maximum columns in an index:          32
> Maximum size of a TOAST chunk:        1996
> Size of a large-object chunk:         2048
> Date/time type storage:               64-bit integers
> Float4 argument passing:              by value
> Float8 argument passing:              by value
> Data page checksum version:           1

I tried pg_xlogdump and it failed on the spot:

> rmgr: XLOG        len (rec/tot):    106/   106, tx:          0, lsn: 0/9B9F5F80, prev 0/9B9F5F48, desc:
CHECKPOINT_SHUTDOWNredo 0/9B9F5F80; tli 1; prev tli 1; fpw true; xid 0:5869245; oid 20756; multi 1; offset 0; oldest
xid1160 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown 
> pg_xlogdump: FATAL:  error in WAL record at 0/9B9F5F80: unexpected pageaddr 0/979F6000 in log segment
00000000000000000000009B,offset 10444800 

This is from the master at the same place:

> rmgr: XLOG        len (rec/tot):    106/   106, tx:          0, lsn: 0/9B9F5F80, prev 0/9B9F5F48, desc:
CHECKPOINT_SHUTDOWNredo 0/9B9F5F80; tli 1; prev tli 1; fpw true; xid 0:5869245; oid 20756; multi 1; offset 0; oldest
xid1160 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown 
> rmgr: XLOG        len (rec/tot):     49/  2805, tx:          0, lsn: 0/9B9F5FF0, prev 0/9B9F5F80, desc: FPI_FOR_HINT
,blkref #0: rel 1663/16396/16398 blk 0 FPW 
> rmgr: Heap        len (rec/tot):     54/  5046, tx:    5869245, lsn: 0/9B9F6B00, prev 0/9B9F5FF0, desc: INSERT off
54,blkref #0: rel 1663/16396/16427 blk 64203 FPW 
> rmgr: Btree       len (rec/tot):     53/  7961, tx:    5869245, lsn: 0/9B9F7EB8, prev 0/9B9F6B00, desc: INSERT_LEAF
off281, blkref #0: rel 1663/16396/16446 blk 21754 FPW 
> rmgr: Transaction len (rec/tot):     34/    34, tx:    5869245, lsn: 0/9B9F9DF0, prev 0/9B9F7EB8, desc: COMMIT
2017-08-2412:19:59.560138 UTC 

I still have the database available.


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Masahiko Sawada
Дата:
Сообщение: Re: [BUGS] BUG #14788: `pg_restore -c` won't restore schema access privileges.
Следующее
От: Masahiko Sawada
Дата:
Сообщение: Re: [BUGS] BUG #14788: `pg_restore -c` won't restore schema access privileges.