Re: Completely broken replica after PANIC: WAL contains references to invalid pages

Поиск
Список
Период
Сортировка
От Sergey Konoplev
Тема Re: Completely broken replica after PANIC: WAL contains references to invalid pages
Дата
Msg-id CAL_0b1vLSGx=ENLkSK1cyDtGcYVAb_im051w-bza3H-7dPgTBA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Completely broken replica after PANIC: WAL contains references to invalid pages  ("anarazel@anarazel.de" <andres@anarazel.de>)
Ответы Re: Completely broken replica after PANIC: WAL contains references to invalid pages  (Andres Freund <andres@2ndquadrant.com>)
Список pgsql-bugs
On Fri, Mar 29, 2013 at 1:52 PM, anarazel@anarazel.de
<andres@anarazel.de> wrote:
> I think I See whats going on. Do you still have the datadir available? If=
 so, could you send the pg_controldata output?

I have already rebuilt the replica, however below is the output if it is us=
eful:

pg_control version number:            922
Catalog version number:               201204301
Database system identifier:           5858109675396804534
Database cluster state:               in archive recovery
pg_control last modified:             =D0=A1=D0=B1=D1=82 30 =D0=9C=D0=B0=D1=
=80 2013 00:21:11
Latest checkpoint location:           258B/BDBBE748
Prior checkpoint location:            258B/86DABCB8
Latest checkpoint's REDO location:    258B/8B78BED0
Latest checkpoint's TimeLineID:       2
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          0/423386899
Latest checkpoint's NextOID:          2758636912
Latest checkpoint's NextMultiXactId:  103920
Latest checkpoint's NextMultiOffset:  431309
Latest checkpoint's oldestXID:        225634745
Latest checkpoint's oldestXID's DB:   16436
Latest checkpoint's oldestActiveXID:  421766298
Time of latest checkpoint:            =D0=9F=D1=82=D0=BD 29 =D0=9C=D0=B0=D1=
=80 2013 22:33:01
Minimum recovery ending location:     258C/14AA5FA0
Backup start location:                0/0
Backup end location:                  0/0
End-of-backup record required:        no
Current wal_level setting:            hot_standby
Current max_connections setting:      550
Current max_prepared_xacts setting:   0
Current max_locks_per_xact setting:   64
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value

>
>
> Andres
>
>>Hardware:
>>
>>IBM System x3650 M4, 148GB RAM, NAS
>>
>>Software:
>>
>>PostgreSQL 9.2.3, yum.postgresql.org
>>CentOS 6.3, kernel 2.6.32-279.22.1.el6.x86_64
>>
>>Configuration:
>>
>>listen_addresses =3D '*'
>>max_connections =3D 550
>>shared_buffers =3D 35GB
>>work_mem =3D 256MB
>>maintenance_work_mem =3D 1GB
>>bgwriter_delay =3D 10ms
>>bgwriter_lru_multiplier =3D 10.0
>>effective_io_concurrency =3D 32
>>wal_level =3D hot_standby
>>synchronous_commit =3D off
>>checkpoint_segments =3D 1024
>>checkpoint_timeout =3D 1h
>>checkpoint_completion_target =3D 0.9
>>checkpoint_warning =3D 5min
>>max_wal_senders =3D 3
>>wal_keep_segments =3D 2048
>>hot_standby =3D on
>>max_standby_streaming_delay =3D 5min
>>hot_standby_feedback =3D on
>>effective_cache_size =3D 133GB
>>log_directory =3D '/var/log/pgsql'
>>log_filename =3D 'postgresql-%Y-%m-%d.log'
>>log_checkpoints =3D on
>>log_line_prefix =3D '%t %p %u@%d from %h [vxid:%v txid:%x] [%i] '
>>log_lock_waits =3D on
>>log_statement =3D 'ddl'
>>log_timezone =3D 'W-SU'
>>track_activity_query_size =3D 4096
>>autovacuum_max_workers =3D 5
>>autovacuum_naptime =3D 5s
>>autovacuum_vacuum_scale_factor =3D 0.05
>>autovacuum_analyze_scale_factor =3D 0.05
>>autovacuum_vacuum_cost_delay =3D 5ms
>>datestyle =3D 'iso, dmy'
>>timezone =3D 'W-SU'
>>lc_messages =3D 'en_US.UTF-8'
>>lc_monetary =3D 'ru_RU.UTF-8'
>>lc_numeric =3D 'ru_RU.UTF-8'
>>lc_time =3D 'ru_RU.UTF-8'
>>default_text_search_config =3D 'pg_catalog.russian'
>>
>>System:
>>
>># Controls the maximum shared segment size, in bytes
>>kernel.shmmax =3D 53287555072
>>
>># Controls the maximum number of shared memory segments, in pages
>>kernel.shmall =3D 13009657
>>
>># Maximum number of file-handles
>>fs.file-max =3D 65535
>>
>># pdflush tuning to prevent lag spikes
>>vm.dirty_ratio =3D 10
>>vm.dirty_background_ratio =3D 1
>>vm.dirty_expire_centisecs =3D 499
>>
>># Prevent the scheduler breakdown
>>kernel.sched_migration_cost =3D 5000000
>>
>># Turned off to provide more CPU to PostgreSQL
>>kernel.sched_autogroup_enabled =3D 0
>>
>># Setup hugepages
>>vm.hugetlb_shm_group =3D 26
>>vm.hugepages_treat_as_movable =3D 0
>>vm.nr_overcommit_hugepages =3D 512
>>
>># The Huge Page Size is 2048kB, so for 35GB shared buffers the number
>>is 17920
>>vm.nr_hugepages =3D 17920
>>
>># Turn off the NUMA local pages reclaim as it leads to wrong caching
>>strategy for databases
>>vm.zone_reclaim_mode =3D 0
>>
>>Environment:
>>
>>HUGETLB_SHM=3Dyes
>>LD_PRELOAD=3D'/usr/lib64/libhugetlbfs.so'
>>export HUGETLB_SHM LD_PRELOAD
>>
>>When it is stopped:
>>
>>2013-03-26 11:50:32 MSK 3775 @ from  [vxid: txid:0] [] LOG:
>>restartpoint complete: wrote 1685004 buffers (36.7%); 0 transaction
>>log file(s) added, 0 removed, 555 recycled; write=3D3237.402 s,
>>sync=3D0.071 s, total=3D3237.507 s; sync files=3D2673, longest=3D0.008 s,
>>average=3D0.000 s
>>2013-03-26 11:50:32 MSK 3775 @ from  [vxid: txid:0] [] LOG:  recovery
>>restart point at 2538/6E154AC0
>>2013-03-26 11:50:32 MSK 3775 @ from  [vxid: txid:0] [] DETAIL:  last
>>completed transaction was at log time 2013-03-26 11:50:31.613948+04
>>2013-03-26 11:50:32 MSK 3775 @ from  [vxid: txid:0] [] LOG:
>>restartpoint starting: xlog
>>2013-03-26 11:51:16 MSK 3773 @ from  [vxid:1/0 txid:0] [] WARNING:
>>page 451 of relation base/16436/2686702648 is uninitialized
>>2013-03-26 11:51:16 MSK 3773 @ from  [vxid:1/0 txid:0] [] CONTEXT:
>>xlog redo vacuum: rel 1663/16436/2686702648; blk 2485,
>>lastBlockVacuumed 0
>>2013-03-26 11:51:16 MSK 3773 @ from  [vxid:1/0 txid:0] [] PANIC:  WAL
>>contains references to invalid pages
>>2013-03-26 11:51:16 MSK 3773 @ from  [vxid:1/0 txid:0] [] CONTEXT:
>>xlog redo vacuum: rel 1663/16436/2686702648; blk 2485,
>>lastBlockVacuumed 0
>>2013-03-26 11:51:16 MSK 3770 @ from  [vxid: txid:0] [] LOG:  startup
>>process (PID 3773) was terminated by signal 6: Aborted
>>2013-03-26 11:51:16 MSK 3770 @ from  [vxid: txid:0] [] LOG:
>>terminating any other active server processes
>>
> >From /var/log/messages:
>>
>>Mar 26 10:50:52 tms2 kernel: : postmaster: page allocation failure.
>>order:8, mode:0xd0
>>Mar 26 10:50:52 tms2 kernel: : Pid: 3774, comm: postmaster Not tainted
>>2.6.32-279.22.1.el6.x86_64 #1
>>Mar 26 10:50:52 tms2 kernel: : Call Trace:
>>Mar 26 10:50:52 tms2 kernel: : [<ffffffff8112343f>] ?
>>__alloc_pages_nodemask+0x77f/0x940
>>Mar 26 10:50:52 tms2 kernel: : [<ffffffff8115d3e2>] ?
>>kmem_getpages+0x62/0x170
>>Mar 26 10:50:52 tms2 kernel: : [<ffffffff8115dffa>] ?
>>fallback_alloc+0x1ba/0x270
>>Mar 26 10:50:52 tms2 kernel: : [<ffffffff8115da4f>] ?
>>cache_grow+0x2cf/0x320
>>Mar 26 10:50:52 tms2 kernel: : [<ffffffff8115dd79>] ?
>>____cache_alloc_node+0x99/0x160
>>Mar 26 10:50:52 tms2 kernel: : [<ffffffff813fd455>] ?
>>dma_pin_iovec_pages+0xb5/0x230
>>Mar 26 10:50:52 tms2 kernel: : [<ffffffff8115eb49>] ?
>>__kmalloc+0x189/0x220
>>Mar 26 10:50:52 tms2 kernel: : [<ffffffff813fd455>] ?
>>dma_pin_iovec_pages+0xb5/0x230
>>Mar 26 10:50:52 tms2 kernel: : [<ffffffff8141a47c>] ?
>>lock_sock_nested+0xac/0xc0
>>Mar 26 10:50:52 tms2 kernel: : [<ffffffff8146edaa>] ?
>>tcp_recvmsg+0x4ca/0xe80
>>Mar 26 10:50:52 tms2 kernel: : [<ffffffffa029a67e>] ?
>>xfs_vm_write_end+0x2e/0x60 [xfs]
>>Mar 26 10:50:52 tms2 kernel: : [<ffffffffa0292f39>] ?
>>xfs_trans_unlocked_item+0x39/0x60 [xfs]
>>Mar 26 10:50:52 tms2 kernel: : [<ffffffff8148f90a>] ?
>>inet_recvmsg+0x5a/0x90
>>Mar 26 10:50:52 tms2 kernel: : [<ffffffff81418b93>] ?
>>sock_recvmsg+0x133/0x160
>>Mar 26 10:50:52 tms2 kernel: : [<ffffffffa029f542>] ?
>>xfs_rw_iunlock+0x32/0x40 [xfs]
>>Mar 26 10:50:52 tms2 kernel: : [<ffffffff81090be0>] ?
>>autoremove_wake_function+0x0/0x40
>>Mar 26 10:50:52 tms2 kernel: : [<ffffffff81173fff>] ?
>>__dentry_open+0x23f/0x360
>>Mar 26 10:50:52 tms2 kernel: : [<ffffffff811764ca>] ?
>>do_sync_write+0xfa/0x140
>>Mar 26 10:50:52 tms2 kernel: : [<ffffffff811875d0>] ?
>>do_filp_open+0x780/0xd60
>>Mar 26 10:50:52 tms2 kernel: : [<ffffffff81418d0e>] ?
>>sys_recvfrom+0xee/0x180
>>Mar 26 10:50:52 tms2 kernel: : [<ffffffff81183105>] ? putname+0x35/0x50
>>Mar 26 10:50:52 tms2 kernel: : [<ffffffff81176842>] ?
>>vfs_write+0x132/0x1a0
>>Mar 26 10:50:52 tms2 kernel: : [<ffffffff810d3f47>] ?
>>audit_syscall_entry+0x1d7/0x200
>>Mar 26 10:50:52 tms2 kernel: : [<ffffffff8100b072>] ?
>>system_call_fastpath+0x16/0x1b
>>
>>After restart:
>>
>>2013-03-27 10:43:01 MSK 1535 @ from  [vxid: txid:0] [] LOG:  database
>>system was interrupted while in recovery at log time 2013-03-26
>>10:08:37 MSK
>>2013-03-27 10:43:01 MSK 1535 @ from  [vxid: txid:0] [] HINT:  If this
>>has occurred more than once some data might be corrupted and you might
>>need to choose an earlier recovery target.
>>2013-03-27 10:43:01 MSK 1535 @ from  [vxid: txid:0] [] LOG:  entering
>>standby mode
>>2013-03-27 10:43:01 MSK 1535 @ from  [vxid:1/0 txid:0] [] LOG:  redo
>>starts at 2538/6E154AC0
>>2013-03-27 10:43:02 MSK 1535 @ from  [vxid:1/0 txid:0] [] LOG:  file
>>"pg_subtrans/10B7" doesn't exist, reading as zeroes
>>2013-03-27 10:43:02 MSK 1535 @ from  [vxid:1/0 txid:0] [] CONTEXT:
>>xlog redo xid assignment xtop 280482186: subxacts: 280483328 280483333
>>280483343 280483354 280483356 280483368 280483377 280483382 280483392
>>280483404 280483416 280483429 280483440 280483451 280483460 280483472
>>280483487 280483500 280483516 280483530 280483541 280483555 280483565
>>280483574 280483585 280483595 280483604 280483607 280483617 280483626
>>280483636 280483646 280483656 280483665 280483677 280483688 280483699
>>280483709 280483719 280483730 280483739 280483749 280483759 280483761
>>280483771 280483782 280483799 280483800 280483811 280483821 280483824
>>280483836 280483847 280483859 280483871 280483874 280483883 280483897
>>280483906 280483915 280483925 280483937 280483948 280483958
>>2013-03-27 10:43:02 MSK 1535 @ from  [vxid:1/0 txid:0] [] LOG:  file
>>"pg_subtrans/10B7" doesn't exist, reading as zeroes
>>2013-03-27 10:43:02 MSK 1535 @ from  [vxid:1/0 txid:0] [] CONTEXT:
>>xlog redo xid assignment xtop 280482270: subxacts: 280485056 280485070
>>280485083 280485086 280485098 280485113 280485132 280485144 280485156
>>280485167 280485178 280485188 280485201 280485217 280485234 280485249
>>280485267 280485293 280485309 280485327 280485333 280485345 280485353
>>280485373 280485388 280485405 280485420 280485434 280485457 280485476
>>280485482 280485507 280485516 280485531 280485537 280485550 280485565
>>280485568 280485585 280485587 280485601 280485613 280485634 280485639
>>280485656 280485669 280485684 280485690 280485693 280485712 280485730
>>280485754 280485757 280485779 280485801 280485808 280485811 280485830
>>280485856 280485880 280485900 280485920 280485941 280485946
>>
>>[ skipped several more messages of this kind]
>>
>>2013-03-27 10:43:03 MSK 1535 @ from  [vxid:1/0 txid:0] [] LOG:  file
>>"pg_subtrans/10B8" doesn't exist, reading as zeroes
>>2013-03-27 10:43:03 MSK 1535 @ from  [vxid:1/0 txid:0] [] CONTEXT:
>>xlog redo xid assignment xtop 280549936: subxacts: 280555514 280555515
>>280555516 280555517 280555518 280555519 280555520 280555521 280555522
>>280555523 280555524 280555525 280555526 280555527 280555528 280555529
>>280555530 280555531 280555532 280555533 280555534 280555535 280555536
>>280555537 280555538 280555539 280555540 280555541 280555542 280555543
>>280555544 280555545 280555546 280555547 280555548 280555549 280555550
>>280555551 280555552 280555553 280555554 280555555 280555556 280555557
>>280555558 280555559 280555560 280555561 280555562 280555563 280555564
>>280555565 280555566 280555567 280555568 280555569 280555570 280555571
>>280555572 280555573 280555574 280555575 280555576 280555577
>>2013-03-27 10:43:03 MSK 1535 @ from  [vxid:1/0 txid:0] [] FATAL:
>>could not access status of transaction 280557568
>>2013-03-27 10:43:03 MSK 1535 @ from  [vxid:1/0 txid:0] [] DETAIL:
>>Could not read from file "pg_subtrans/10B8" at offset 253952: Success.
>>2013-03-27 10:43:03 MSK 1535 @ from  [vxid:1/0 txid:0] [] CONTEXT:
>>xlog redo xid assignment xtop 280555981: subxacts: 280557520 280557521
>>280557522 280557523 280557524 280557525 280557526 280557527 280557528
>>280557529 280557530 280557531 280557532 280557533 280557534 280557535
>>280557536 280557537 280557538 280557539 280557540 280557541 280557542
>>280557543 280557544 280557545 280557546 280557547 280557548 280557549
>>280557550 280557551 280557552 280557553 280557554 280557555 280557556
>>280557557 280557558 280557559 280557560 280557561 280557562 280557563
>>280557564 280557565 280557566 280557567 280557568 280557569 280557570
>>280557571 280557572 280557573 280557574 280557575 280557576 280557577
>>280557578 280557579 280557580 280557581 280557582 280557583
>>2013-03-27 10:43:03 MSK 1532 @ from  [vxid: txid:0] [] LOG:  startup
>>process (PID 1535) exited with exit code 1
>>2013-03-27 10:43:03 MSK 1532 @ from  [vxid: txid:0] [] LOG:
>>terminating any other active server processes
>>
>>Thank you in advance.
>>
>>--
>>Kind regards,
>>Sergey Konoplev
>>Database and Software Consultant
>>
>>Profile: http://www.linkedin.com/in/grayhemp
>>Phone: USA +1 (415) 867-9984, Russia +7 (901) 903-0499, +7 (988)
>>888-1979
>>Skype: gray-hemp
>>Jabber: gray.ru@gmail.com
>
>
> ---
> Please excuse brevity and formatting - I am writing this on my mobile pho=
ne.
> ---
> Please excuse brevity and formatting - I am writing this on my mobile pho=
ne.



--
Kind regards,
Sergey Konoplev
Database and Software Consultant

Profile: http://www.linkedin.com/in/grayhemp
Phone: USA +1 (415) 867-9984, Russia +7 (901) 903-0499, +7 (988) 888-1979
Skype: gray-hemp
Jabber: gray.ru@gmail.com

В списке pgsql-bugs по дате отправления:

Предыдущее
От: "anarazel@anarazel.de"
Дата:
Сообщение: Re: Completely broken replica after PANIC: WAL contains references to invalid pages
Следующее
От: Andres Freund
Дата:
Сообщение: Re: Completely broken replica after PANIC: WAL contains references to invalid pages