Обсуждение: BUG #7500: hot-standby replica crash after an initial rsync

Поиск
Список
Период
Сортировка

BUG #7500: hot-standby replica crash after an initial rsync

От
maxim.boguk@gmail.com
Дата:
The following bug has been logged on the website:

Bug reference:      7500
Logged by:          Maxim Boguk
Email address:      maxim.boguk@gmail.com
PostgreSQL version: 9.0.8
Operating system:   FreeBSD
Description:        =


Hi,

Reinitialization of the replica after failover is procedure performed by me
a lot times in that installation. And I never had any issues. But today
after following the same pg_start_backup/rsync procedure replica doesn't
start with following error:

2012-08-21 11:17:07.777 MSK 34329 @ from  [vxid: txid:0] []LOG:  entering
standby mode
2012-08-21 11:17:07.780 MSK 34329 @ from  [vxid:1/0 txid:0] []LOG:  redo
starts at 925/2C0004C8
2012-08-21 11:17:07.783 MSK 34329 @ from  [vxid:1/0 txid:0] []FATAL:  could
not access status of transaction 361788644
2012-08-21 11:17:07.783 MSK 34329 @ from  [vxid:1/0 txid:0] []DETAIL:  Could
not read from file "pg_multixact/offsets/1590" at offset 114688: Unknown
error: 0.
2012-08-21 11:17:07.783 MSK 34329 @ from  [vxid:1/0 txid:0] []CONTEXT:  xlog
redo create multixact 361788644 offset 822580965: 1768508061 1768508078
2012-08-21 11:17:07.791 MSK 34327 @ from  [vxid: txid:0] []LOG:  startup
process (PID 34329) exited with exit code 1
2012-08-21 11:17:07.791 MSK 34327 @ from  [vxid: txid:0] []LOG:  terminating
any other active server processes


I have kept all that database files for the future investigation.

What I should look into first?

Re: BUG #7500: hot-standby replica crash after an initial rsync

От
Andres Freund
Дата:
On Tuesday, August 21, 2012 10:08:11 AM maxim.boguk@gmail.com wrote:
> The following bug has been logged on the website:
>
> Bug reference:      7500
> Logged by:          Maxim Boguk
> Email address:      maxim.boguk@gmail.com
> PostgreSQL version: 9.0.8
> Operating system:   FreeBSD
> Description:
>
> Hi,
>
> Reinitialization of the replica after failover is procedure performed by me
> a lot times in that installation. And I never had any issues. But today
> after following the same pg_start_backup/rsync procedure replica doesn't
> start with following error:
>
> 2012-08-21 11:17:07.777 MSK 34329 @ from  [vxid: txid:0] []LOG:  entering
> standby mode
> 2012-08-21 11:17:07.780 MSK 34329 @ from  [vxid:1/0 txid:0] []LOG:  redo
> starts at 925/2C0004C8
> 2012-08-21 11:17:07.783 MSK 34329 @ from  [vxid:1/0 txid:0] []FATAL:  could
> not access status of transaction 361788644
> 2012-08-21 11:17:07.783 MSK 34329 @ from  [vxid:1/0 txid:0] []DETAIL:
> Could not read from file "pg_multixact/offsets/1590" at offset 114688:
> Unknown error: 0.
> 2012-08-21 11:17:07.783 MSK 34329 @ from  [vxid:1/0 txid:0] []CONTEXT:
> xlog redo create multixact 361788644 offset 822580965: 1768508061
> 1768508078 2012-08-21 11:17:07.791 MSK 34327 @ from  [vxid: txid:0] []LOG:
>  startup process (PID 34329) exited with exit code 1
> 2012-08-21 11:17:07.791 MSK 34327 @ from  [vxid: txid:0] []LOG:
> terminating any other active server processes
>
>
> I have kept all that database files for the future investigation.
>
> What I should look into first?
Could you reproduce the error with log_error_verbosity=verbose? Or even better
provide a backtrace with gdb?

Greetings,

Andres
--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #7500: hot-standby replica crash after an initial rsync

От
Maxim Boguk
Дата:
> >
> >
> > I have kept all that database files for the future investigation.
> >
> > What I should look into first?
> Could you reproduce the error with log_error_verbosity=verbose? Or even
> better
> provide a backtrace with gdb?
>
>
There log with log_error_verbosity=verbose:


2012-08-21 14:04:18.973 MSK 33950 @ from  [vxid: txid:0] []LOG:  00000:
database system was interrupted while in recovery at log time 2012-08-21
08:21:42 MSK
2012-08-21 14:04:18.973 MSK 33950 @ from  [vxid: txid:0] []HINT:  If this
has occurred more than once some data might be corrupted and you might need
to choose an earlier recovery target.
2012-08-21 14:04:18.973 MSK 33950 @ from  [vxid: txid:0] []LOCATION:
StartupXLOG, xlog.c:5757
2012-08-21 14:04:18.973 MSK 33950 @ from  [vxid: txid:0] []LOG:  00000:
creating missing WAL directory "pg_xlog/archive_status"
2012-08-21 14:04:18.973 MSK 33950 @ from  [vxid: txid:0] []LOCATION:
ValidateXLOGDirectoryStructure, xlog.c:3452
2012-08-21 14:04:19.009 MSK 33950 @ from  [vxid: txid:0] []LOG:  00000:
entering standby mode
2012-08-21 14:04:19.009 MSK 33950 @ from  [vxid: txid:0] []LOCATION:
StartupXLOG, xlog.c:5827
2012-08-21 14:04:19.041 MSK 33951 @ from  [vxid: txid:0] []LOG:  00000:
streaming replication successfully connected to primary
2012-08-21 14:04:19.041 MSK 33951 @ from  [vxid: txid:0] []LOCATION:
libpqrcv_connect, libpqwalreceiver.c:169
2012-08-21 14:04:20.140 MSK 33950 @ from  [vxid:1/0 txid:0] []LOG:  00000:
redo starts at 925/2C0004C8
2012-08-21 14:04:20.140 MSK 33950 @ from  [vxid:1/0 txid:0] []LOCATION:
StartupXLOG, xlog.c:6199
2012-08-21 14:04:20.160 MSK 33950 @ from  [vxid:1/0 txid:0] []FATAL:
XX000: could not access status of transaction 361788644
2012-08-21 14:04:20.161 MSK 33950 @ from  [vxid:1/0 txid:0] []DETAIL:
Could not read from file "pg_multixact/offsets/1590" at offset 114688:
Unknown error: 0.
2012-08-21 14:04:20.161 MSK 33950 @ from  [vxid:1/0 txid:0] []CONTEXT:
xlog redo create multixact 361788644 offset 822580965: 1768508061 1768508078
2012-08-21 14:04:20.161 MSK 33950 @ from  [vxid:1/0 txid:0] []LOCATION:
SlruReportIOError, slru.c:840
2012-08-21 14:04:20.169 MSK 33948 @ from  [vxid: txid:0] []LOG:  00000:
startup process (PID 33950) exited with exit code 1
2012-08-21 14:04:20.169 MSK 33948 @ from  [vxid: txid:0] []LOCATION:
LogChildExit, postmaster.c:2840
2012-08-21 14:04:20.169 MSK 33948 @ from  [vxid: txid:0] []LOG:  00000:
terminating any other active server processes
2012-08-21 14:04:20.169 MSK 33948 @ from  [vxid: txid:0] []LOCATION:
HandleChildCrash, postmaster.c:2674


Kind Regards,
Maksym

Re: BUG #7500: hot-standby replica crash after an initial rsync

От
Maxim Boguk
Дата:
On Tue, Aug 21, 2012 at 8:08 PM, Maxim Boguk <maxim.boguk@gmail.com> wrote:

>
> >
>> >
>> > I have kept all that database files for the future investigation.
>> >
>> > What I should look into first?
>> Could you reproduce the error with log_error_verbosity=verbose? Or even
>> better
>> provide a backtrace with gdb?
>>
>>
> There log with log_error_verbosity=verbose:
>
>
> 2012-08-21 14:04:18.973 MSK 33950 @ from  [vxid: txid:0] []LOG:  00000:
> database system was interrupted while in recovery at log time 2012-08-21
> 08:21:42 MSK
> 2012-08-21 14:04:18.973 MSK 33950 @ from  [vxid: txid:0] []HINT:  If this
> has occurred more than once some data might be corrupted and you might need
> to choose an earlier recovery target.
> 2012-08-21 14:04:18.973 MSK 33950 @ from  [vxid: txid:0] []LOCATION:
> StartupXLOG, xlog.c:5757
> 2012-08-21 14:04:18.973 MSK 33950 @ from  [vxid: txid:0] []LOG:  00000:
> creating missing WAL directory "pg_xlog/archive_status"
> 2012-08-21 14:04:18.973 MSK 33950 @ from  [vxid: txid:0] []LOCATION:
> ValidateXLOGDirectoryStructure, xlog.c:3452
> 2012-08-21 14:04:19.009 MSK 33950 @ from  [vxid: txid:0] []LOG:  00000:
> entering standby mode
> 2012-08-21 14:04:19.009 MSK 33950 @ from  [vxid: txid:0] []LOCATION:
> StartupXLOG, xlog.c:5827
> 2012-08-21 14:04:19.041 MSK 33951 @ from  [vxid: txid:0] []LOG:  00000:
> streaming replication successfully connected to primary
> 2012-08-21 14:04:19.041 MSK 33951 @ from  [vxid: txid:0] []LOCATION:
> libpqrcv_connect, libpqwalreceiver.c:169
> 2012-08-21 14:04:20.140 MSK 33950 @ from  [vxid:1/0 txid:0] []LOG:  00000:
> redo starts at 925/2C0004C8
> 2012-08-21 14:04:20.140 MSK 33950 @ from  [vxid:1/0 txid:0] []LOCATION:
> StartupXLOG, xlog.c:6199
> 2012-08-21 14:04:20.160 MSK 33950 @ from  [vxid:1/0 txid:0] []FATAL:
> XX000: could not access status of transaction 361788644
> 2012-08-21 14:04:20.161 MSK 33950 @ from  [vxid:1/0 txid:0] []DETAIL:
> Could not read from file "pg_multixact/offsets/1590" at offset 114688:
> Unknown error: 0.
> 2012-08-21 14:04:20.161 MSK 33950 @ from  [vxid:1/0 txid:0] []CONTEXT:
> xlog redo create multixact 361788644 offset 822580965: 1768508061 1768508078
> 2012-08-21 14:04:20.161 MSK 33950 @ from  [vxid:1/0 txid:0] []LOCATION:
> SlruReportIOError, slru.c:840
> 2012-08-21 14:04:20.169 MSK 33948 @ from  [vxid: txid:0] []LOG:  00000:
> startup process (PID 33950) exited with exit code 1
> 2012-08-21 14:04:20.169 MSK 33948 @ from  [vxid: txid:0] []LOCATION:
> LogChildExit, postmaster.c:2840
> 2012-08-21 14:04:20.169 MSK 33948 @ from  [vxid: txid:0] []LOG:  00000:
> terminating any other active server processes
> 2012-08-21 14:04:20.169 MSK 33948 @ from  [vxid: txid:0] []LOCATION:
> HandleChildCrash, postmaster.c:2674
>
>
I tried A bit play with more detailed debug info and got mode detailed log:

LOG:  00000: database system was interrupted while in recovery at log time
2012-08-21 08:21:42 MSK
HINT:  If this has occurred more than once some data might be corrupted and
you might need to choose an earlier recovery target.
LOCATION:  StartupXLOG, xlog.c:5757
DEBUG:  00000: standby_mode = 'on'
LOCATION:  readRecoveryCommandFile, xlog.c:5269
DEBUG:  00000: primary_conninfo = 'host=10.12.22.19 port=5433 user=replica
password=acilper'
LOCATION:  readRecoveryCommandFile, xlog.c:5276
DEBUG:  00000: trigger_file = '/db/data/failover'
LOCATION:  readRecoveryCommandFile, xlog.c:5283
LOG:  00000: entering standby mode
LOCATION:  StartupXLOG, xlog.c:5827
DEBUG:  00000: checkpoint record is at 925/2D41CAA0
LOCATION:  StartupXLOG, xlog.c:5894
DEBUG:  00000: redo record is at 925/2C0004C8; shutdown FALSE
LOCATION:  StartupXLOG, xlog.c:5929
DEBUG:  00000: next transaction ID: 0/1768511856; next OID: 59880938
LOCATION:  StartupXLOG, xlog.c:5933
DEBUG:  00000: next MultiXactId: 361788644; next MultiXactOffset: 822580965
LOCATION:  StartupXLOG, xlog.c:5936
DEBUG:  00000: oldest unfrozen transaction ID: 1569182525, in database 16424
LOCATION:  StartupXLOG, xlog.c:5939
DEBUG:  00000: transaction ID wrap limit is 3716666172, limited by database
with OID 16424
LOCATION:  SetTransactionIdLimit, varsup.c:334
DEBUG:  00000: initializing for hot standby
LOCATION:  StartupXLOG, xlog.c:6070
DEBUG:  00000: my backend id is 1
LOCATION:  SharedInvalBackendInit, sinvaladt.c:326
LOG:  00000: redo starts at 925/2C0004C8
LOCATION:  StartupXLOG, xlog.c:6199
DEBUG:  00000: record known xact 1768508078 latestObservedXid 0
CONTEXT:  xlog redo commit: 2012-08-21 08:21:42.079655+04
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: record known xact 1768508078 latestObservedXid 0
CONTEXT:  xlog redo commit: 2012-08-21 08:21:42.079655+04
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: remove KnownAssignedXid 1768508078
CONTEXT:  xlog redo commit: 2012-08-21 08:21:42.079655+04
LOCATION:  KnownAssignedXidsRemove, procarray.c:2869
DEBUG:  00000: record known xact 1768508080 latestObservedXid 0
CONTEXT:  xlog redo exclusive_lock: xid 1768508080 rel 1663/16424/17282;
tid 725829/8
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: record known xact 1768508080 latestObservedXid 0
CONTEXT:  xlog redo update: rel 1663/16424/17282; tid 725829/8; new 726017/7
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: record known xact 1768508076 latestObservedXid 0
CONTEXT:  xlog redo exclusive_lock: xid 1768508076 rel 1663/16424/47760635;
tid 3075/136
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: record known xact 1768508080 latestObservedXid 0
CONTEXT:  xlog redo insert: rel 1663/16424/47747; tid 43866/197
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: record known xact 1768508080 latestObservedXid 0
CONTEXT:  xlog redo insert: rel 1663/16424/52546872; tid 44700/276
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: record known xact 1768508061 latestObservedXid 0
CONTEXT:  xlog redo log: rel 1663/16424/19211
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: record known xact 1768508082 latestObservedXid 0
CONTEXT:  xlog redo exclusive_lock: xid 1768508082 rel 1663/16424/4759107;
tid 1098/7
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: record known xact 1768508076 latestObservedXid 0
CONTEXT:  xlog redo hot_update: rel 1663/16424/47760635; tid 3075/136; new
3075/140
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: record known xact 1768508080 latestObservedXid 0
CONTEXT:  xlog redo exclusive_lock: xid 1768508080 rel 1663/16424/17282;
tid 725918/2
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: record known xact 1768508082 latestObservedXid 0
CONTEXT:  xlog redo hot_update: rel 1663/16424/4759107; tid 1098/7; new
1098/39
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: record known xact 1768508080 latestObservedXid 0
CONTEXT:  xlog redo update: rel 1663/16424/17282; tid 725918/2; new 725918/6
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: record known xact 1768508080 latestObservedXid 0
CONTEXT:  xlog redo insert: rel 1663/16424/47747; tid 43866/201
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: record known xact 1768508080 latestObservedXid 0
CONTEXT:  xlog redo insert: rel 1663/16424/52546872; tid 44700/276
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: record known xact 1768508080 latestObservedXid 0
CONTEXT:  xlog redo commit: 2012-08-21 08:21:42.082508+04
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: record known xact 1768508080 latestObservedXid 0
CONTEXT:  xlog redo commit: 2012-08-21 08:21:42.082508+04
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: remove KnownAssignedXid 1768508080
CONTEXT:  xlog redo commit: 2012-08-21 08:21:42.082508+04
LOCATION:  KnownAssignedXidsRemove, procarray.c:2869
DEBUG:  00000: record known xact 1768508061 latestObservedXid 0
CONTEXT:  xlog redo insert: rel 1663/16424/64654; tid 178183/44
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: record known xact 1768508061 latestObservedXid 0
CONTEXT:  xlog redo insert: rel 1663/16424/95419; tid 38650/318
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: record known xact 1768508061 latestObservedXid 0


CONTEXT:  xlog redo create multixact 361788644 offset 822580965: 1768508061
1768508078
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
FATAL:  XX000: could not access status of transaction 361788644
DETAIL:  Could not read from file "pg_multixact/offsets/1590" at offset
114688: Unknown error: 0.
CONTEXT:  xlog redo create multixact 361788644 offset 822580965: 1768508061
1768508078
LOCATION:  SlruReportIOError, slru.c:840


DEBUG:  00000: shmem_exit(1): 5 callbacks to make



I have some problems with debug startup process with gdb...
I following next sequence of commands (and got no useful results):

[pgsql@db1 ~/tmp/postgresql-9.0.8]$ /usr/local/bin/gdb
/usr/local/bin/postgres
...
Reading symbols from /usr/local/bin/postgres...done.
(gdb) b exit
Breakpoint 1 at 0x44d38c
(gdb) run -D /db/data_back/
Starting program: /usr/local/bin/postgres -D /db/data_back/

Program received signal SIGUSR1, User defined signal 1.
0x00000008018ddadc in select () from /lib/libc.so.7
(gdb) bt
#0  0x00000008018ddadc in select () from /lib/libc.so.7
#1  0x000000000065641c in ServerLoop () at postmaster.c:1396
#2  0x0000000000655d05 in PostmasterMain (argc=3, argv=0x7fffffffead0) at
postmaster.c:1097
#3  0x00000000005d974a in main (argc=3, argv=0x7fffffffead0) at main.c:188
(gdb) cont
Continuing.

Breakpoint 1, 0x000000080186bab0 in exit () from /lib/libc.so.7
(gdb) bt
#0  0x000000080186bab0 in exit () from /lib/libc.so.7
#1  0x000000000067d0a6 in proc_exit (code=1) at ipc.c:137
#2  0x0000000000659933 in ExitPostmaster (status=1) at postmaster.c:4172
#3  0x0000000000657d2d in reaper (postgres_signal_arg=20) at
postmaster.c:2354
#4  <signal handler called>
#5  <signal handler called>
#6  0x00000008018ddadc in select () from /lib/libc.so.7
#7  0x000000000065641c in ServerLoop () at postmaster.c:1396
#8  0x0000000000655d05 in PostmasterMain (argc=3, argv=0x7fffffffead0) at
postmaster.c:1097
#9  0x00000000005d974a in main (argc=3, argv=0x7fffffffead0) at main.c:188
(gdb) cont
Continuing.
[Inferior 1 (process 98857) exited with code 01]
(gdb) quit

Re: BUG #7500: hot-standby replica crash after an initial rsync

От
Andres Freund
Дата:
Hi,


On Tuesday, August 21, 2012 02:28:54 PM Maxim Boguk wrote:
> On Tue, Aug 21, 2012 at 8:08 PM, Maxim Boguk <maxim.boguk@gmail.com> wrote:
> >> > I have kept all that database files for the future investigation.
> >> >
> >> > What I should look into first?
> >>
> >> Could you reproduce the error with log_error_verbosity=verbose? Or even
> >> better
> >> provide a backtrace with gdb?
> >
> > There log with log_error_verbosity=verbose:
> LOG:  00000: database system was interrupted while in recovery at log time
> 2012-08-21 08:21:42 MSK
> HINT:  If this has occurred more than once some data might be corrupted and
> you might need to choose an earlier recovery target.
> LOCATION:  StartupXLOG, xlog.c:5757
> DEBUG:  00000: standby_mode = 'on'
> LOCATION:  readRecoveryCommandFile, xlog.c:5269
> DEBUG:  00000: primary_conninfo = 'host=10.12.22.19 port=5433 user=replica
> password=acilper'
> LOCATION:  readRecoveryCommandFile, xlog.c:5276
> DEBUG:  00000: trigger_file = '/db/data/failover'
> LOCATION:  readRecoveryCommandFile, xlog.c:5283
> LOG:  00000: entering standby mode
> LOCATION:  StartupXLOG, xlog.c:5827
> DEBUG:  00000: checkpoint record is at 925/2D41CAA0
> LOCATION:  StartupXLOG, xlog.c:5894
> DEBUG:  00000: redo record is at 925/2C0004C8; shutdown FALSE
> LOCATION:  StartupXLOG, xlog.c:5929
> DEBUG:  00000: next transaction ID: 0/1768511856; next OID: 59880938
> LOCATION:  StartupXLOG, xlog.c:5933
> DEBUG:  00000: next MultiXactId: 361788644; next MultiXactOffset: 822580965
> LOCATION:  StartupXLOG, xlog.c:5936
> DEBUG:  00000: oldest unfrozen transaction ID: 1569182525, in database
> 16424 LOCATION:  StartupXLOG, xlog.c:5939
> DEBUG:  00000: transaction ID wrap limit is 3716666172, limited by database
> with OID 16424
> LOCATION:  SetTransactionIdLimit, varsup.c:334
> DEBUG:  00000: initializing for hot standby
> LOCATION:  StartupXLOG, xlog.c:6070
> DEBUG:  00000: my backend id is 1
> LOCATION:  SharedInvalBackendInit, sinvaladt.c:326
> LOG:  00000: redo starts at 925/2C0004C8
> LOCATION:  StartupXLOG, xlog.c:6199
> DEBUG:  00000: record known xact 1768508078 latestObservedXid 0
> CONTEXT:  xlog redo commit: 2012-08-21 08:21:42.079655+04
> LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
> DEBUG:  00000: record known xact 1768508078 latestObservedXid 0
> CONTEXT:  xlog redo commit: 2012-08-21 08:21:42.079655+04
> LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
> DEBUG:  00000: remove KnownAssignedXid 1768508078
> CONTEXT:  xlog redo commit: 2012-08-21 08:21:42.079655+04
> LOCATION:  KnownAssignedXidsRemove, procarray.c:2869
> DEBUG:  00000: record known xact 1768508080 latestObservedXid 0
> CONTEXT:  xlog redo exclusive_lock: xid 1768508080 rel 1663/16424/17282;
> tid 725829/8
> LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
> DEBUG:  00000: record known xact 1768508080 latestObservedXid 0
> CONTEXT:  xlog redo update: rel 1663/16424/17282; tid 725829/8; new
> 726017/7 LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
> DEBUG:  00000: record known xact 1768508076 latestObservedXid 0
> CONTEXT:  xlog redo exclusive_lock: xid 1768508076 rel 1663/16424/47760635;
> tid 3075/136
> LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
> DEBUG:  00000: record known xact 1768508080 latestObservedXid 0
> CONTEXT:  xlog redo insert: rel 1663/16424/47747; tid 43866/197
> LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
> DEBUG:  00000: record known xact 1768508080 latestObservedXid 0
> CONTEXT:  xlog redo insert: rel 1663/16424/52546872; tid 44700/276
> LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
> DEBUG:  00000: record known xact 1768508061 latestObservedXid 0
> CONTEXT:  xlog redo log: rel 1663/16424/19211
> LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
> DEBUG:  00000: record known xact 1768508082 latestObservedXid 0
> CONTEXT:  xlog redo exclusive_lock: xid 1768508082 rel 1663/16424/4759107;
> tid 1098/7
> LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
> DEBUG:  00000: record known xact 1768508076 latestObservedXid 0
> CONTEXT:  xlog redo hot_update: rel 1663/16424/47760635; tid 3075/136; new
> 3075/140
> LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
> DEBUG:  00000: record known xact 1768508080 latestObservedXid 0
> CONTEXT:  xlog redo exclusive_lock: xid 1768508080 rel 1663/16424/17282;
> tid 725918/2
> LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
> DEBUG:  00000: record known xact 1768508082 latestObservedXid 0
> CONTEXT:  xlog redo hot_update: rel 1663/16424/4759107; tid 1098/7; new
> 1098/39
> LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
> DEBUG:  00000: record known xact 1768508080 latestObservedXid 0
> CONTEXT:  xlog redo update: rel 1663/16424/17282; tid 725918/2; new
> 725918/6 LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
> DEBUG:  00000: record known xact 1768508080 latestObservedXid 0
> CONTEXT:  xlog redo insert: rel 1663/16424/47747; tid 43866/201
> LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
> DEBUG:  00000: record known xact 1768508080 latestObservedXid 0
> CONTEXT:  xlog redo insert: rel 1663/16424/52546872; tid 44700/276
> LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
> DEBUG:  00000: record known xact 1768508080 latestObservedXid 0
> CONTEXT:  xlog redo commit: 2012-08-21 08:21:42.082508+04
> LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
> DEBUG:  00000: record known xact 1768508080 latestObservedXid 0
> CONTEXT:  xlog redo commit: 2012-08-21 08:21:42.082508+04
> LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
> DEBUG:  00000: remove KnownAssignedXid 1768508080
> CONTEXT:  xlog redo commit: 2012-08-21 08:21:42.082508+04
> LOCATION:  KnownAssignedXidsRemove, procarray.c:2869
> DEBUG:  00000: record known xact 1768508061 latestObservedXid 0
> CONTEXT:  xlog redo insert: rel 1663/16424/64654; tid 178183/44
> LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
> DEBUG:  00000: record known xact 1768508061 latestObservedXid 0
> CONTEXT:  xlog redo insert: rel 1663/16424/95419; tid 38650/318
> LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
> DEBUG:  00000: record known xact 1768508061 latestObservedXid 0
>
>
> CONTEXT:  xlog redo create multixact 361788644 offset 822580965: 1768508061
> 1768508078
> LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
> FATAL:  XX000: could not access status of transaction 361788644
> DETAIL:  Could not read from file "pg_multixact/offsets/1590" at offset
> 114688: Unknown error: 0.
> CONTEXT:  xlog redo create multixact 361788644 offset 822580965: 1768508061
> 1768508078
> LOCATION:  SlruReportIOError, slru.c:840
The strange thing here is the "unknown error: 0" bit. That indicates that
slru_errno isn't properly set...

>
> I have some problems with debug startup process with gdb...
> I following next sequence of commands (and got no useful results):
Youre debugging the postmaster that way. The easiest way would be to just
attach to the startup process with gdb -p. Not sure if you can manage that
timingwise without manually adding sleeps.

Otherwise you probably will have to work with "set follow-fork-mode ask" in
gdb. Getting to the right child isn't easy though.

> [pgsql@db1 ~/tmp/postgresql-9.0.8]$ /usr/local/bin/gdb
> /usr/local/bin/postgres
> ...
> Reading symbols from /usr/local/bin/postgres...done.
> (gdb) b exit
> Breakpoint 1 at 0x44d38c
> (gdb) run -D /db/data_back/
> Starting program: /usr/local/bin/postgres -D /db/data_back/
>
> Program received signal SIGUSR1, User defined signal 1.
> 0x00000008018ddadc in select () from /lib/libc.so.7
> (gdb) bt
> #0  0x00000008018ddadc in select () from /lib/libc.so.7
> #1  0x000000000065641c in ServerLoop () at postmaster.c:1396
> #2  0x0000000000655d05 in PostmasterMain (argc=3, argv=0x7fffffffead0) at
> postmaster.c:1097
> #3  0x00000000005d974a in main (argc=3, argv=0x7fffffffead0) at main.c:188
> (gdb) cont
> Continuing.
handle SIGUSR1 nohandle noprint
is useful to avoid this btw.

> Breakpoint 1, 0x000000080186bab0 in exit () from /lib/libc.so.7
> (gdb) bt
> #0  0x000000080186bab0 in exit () from /lib/libc.so.7
> #1  0x000000000067d0a6 in proc_exit (code=1) at ipc.c:137
> #2  0x0000000000659933 in ExitPostmaster (status=1) at postmaster.c:4172
> #3  0x0000000000657d2d in reaper (postgres_signal_arg=20) at
> postmaster.c:2354
> #4  <signal handler called>
> #5  <signal handler called>
> #6  0x00000008018ddadc in select () from /lib/libc.so.7
> #7  0x000000000065641c in ServerLoop () at postmaster.c:1396
> #8  0x0000000000655d05 in PostmasterMain (argc=3, argv=0x7fffffffead0) at
> postmaster.c:1097
> #9  0x00000000005d974a in main (argc=3, argv=0x7fffffffead0) at main.c:188
> (gdb) cont
> Continuing.
> [Inferior 1 (process 98857) exited with code 01]
> (gdb) quit

Andres
--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #7500: hot-standby replica crash after an initial rsync

От
Maxim Boguk
Дата:
Hi Andres,


> > I have some problems with debug startup process with gdb...
> > I following next sequence of commands (and got no useful results):
> Youre debugging the postmaster that way. The easiest way would be to just
> attach to the startup process with gdb -p. Not sure if you can manage that
> timingwise without manually adding sleeps.
>

It's imposible without added sleep seems. Too fast for me.
Where I should add sleep call in PostgreSQL source? (process managment part
of PostgreSQL isn't known for me at all).


>
> Otherwise you probably will have to work with "set follow-fork-mode ask" in
> gdb. Getting to the right child isn't easy though.
>

Unfortunatelly gdb under FreeBSD doesn't suport set follow-fork-mode.

Kind Regards,
Maksy

Re: BUG #7500: hot-standby replica crash after an initial rsync

От
Andres Freund
Дата:
On Tuesday, August 21, 2012 03:30:44 PM Maxim Boguk wrote:
> Hi Andres,
>
> > > I have some problems with debug startup process with gdb...
> >
> > > I following next sequence of commands (and got no useful results):
> > Youre debugging the postmaster that way. The easiest way would be to just
> > attach to the startup process with gdb -p. Not sure if you can manage
> > that timingwise without manually adding sleeps.
>
> It's imposible without added sleep seems. Too fast for me.
> Where I should add sleep call in PostgreSQL source? (process managment part
> of PostgreSQL isn't known for me at all).
I would suggest adding it to StartupXLOG() in xlog.c just before the /* REDO
*/ comment or similar.

I would add something akin to

elog(WARNING, "pid of startup is: %d, sleeping for 10s", getpid());
sleep(10);

Greetings,

Andres
--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #7500: hot-standby replica crash after an initial rsync

От
Maxim Boguk
Дата:
On Wed, Aug 22, 2012 at 6:08 AM, Andres Freund <andres@2ndquadrant.com>wrote:

> On Tuesday, August 21, 2012 03:30:44 PM Maxim Boguk wrote:
> > Hi Andres,
>
> I would add something akin to
>
> elog(WARNING, "pid of startup is: %d, sleeping for 10s", getpid());
> sleep(10);
>

Hi Andres,

There is it:

gdb backtrace:

#0  0x0000000800bd1ab0 in exit () from /lib/libc.so.7
#1  0x00000000005d77f6 in proc_exit (code=1) at ipc.c:137
#2  0x0000000000697948 in errfinish (dummy=<optimized out>) at elog.c:501
#3  0x0000000000488956 in SlruReportIOError (ctl=<optimized out>,
pageno=<optimized out>, xid=361788644) at slru.c:857
#4  0x000000000048956a in SimpleLruReadPage (ctl=0x95bf80, pageno=176654,
write_ok=1 '\001', xid=361788644) at slru.c:427
#5  0x0000000000489f7f in RecordNewMultiXact (multi=361788644,
offset=822580965, nxids=2, xids=0x800e5302c) at multixact.c:764
#6  0x000000000048a126 in multixact_redo (lsn=..., record=0x800e53000) at
multixact.c:2017
#7  0x0000000000484bea in StartupXLOG () at xlog.c:6277
#8  0x00000000004870a8 in StartupProcessMain () at xlog.c:9335
#9  0x000000000049cab2 in AuxiliaryProcessMain (argc=2,
argv=0x7fffffffe450) at bootstrap.c:412
#10 0x00000000005ba926 in StartChildProcess (type=StartupProcess) at
postmaster.c:4442
#11 0x00000000005bccb2 in PostmasterMain (argc=5, argv=0x7fffffffea38) at
postmaster.c:1093
#12 0x0000000000566f3b in main (argc=5, argv=0x7fffffffea38) at main.c:188

relevant database log:
2012-08-22 07:47:21.330 MSK 18155 @ from  [vxid:1/0 txid:0] []LOG:  00000:
redo starts at 925/2C0004C8
2012-08-22 07:47:21.330 MSK 18155 @ from  [vxid:1/0 txid:0] []LOCATION:
StartupXLOG, xlog.c:6202
2012-08-22 07:47:21.384 MSK 18155 @ from  [vxid:1/0 txid:0] []FATAL:
XX000: could not access status of transaction 361788644
2012-08-22 07:47:21.384 MSK 18155 @ from  [vxid:1/0 txid:0] []DETAIL:
Could not read from file "pg_multixact/offsets/1590" at offset 114688:
Unknown error: 0.
2012-08-22 07:47:21.384 MSK 18155 @ from  [vxid:1/0 txid:0] []CONTEXT:
xlog redo create multixact 361788644 offset 822580965: 1768508061 1768508078
2012-08-22 07:47:21.384 MSK 18155 @ from  [vxid:1/0 txid:0] []LOCATION:
SlruReportIOError, slru.c:840


Kind Regards,
Maksym

Re: BUG #7500: hot-standby replica crash after an initial rsync

От
Maxim Boguk
Дата:
On Wed, Aug 22, 2012 at 1:50 PM, Maxim Boguk <maxim.boguk@gmail.com> wrote:

>
>
> On Wed, Aug 22, 2012 at 6:08 AM, Andres Freund <andres@2ndquadrant.com>wrote:
>
>> On Tuesday, August 21, 2012 03:30:44 PM Maxim Boguk wrote:
>> > Hi Andres,
>>
>> I would add something akin to
>>
>> elog(WARNING, "pid of startup is: %d, sleeping for 10s", getpid());
>> sleep(10);
>>
>
> Hi Andres,
>
> There is it:
>
> gdb backtrace:
>
> #0  0x0000000800bd1ab0 in exit () from /lib/libc.so.7
> #1  0x00000000005d77f6 in proc_exit (code=1) at ipc.c:137
> #2  0x0000000000697948 in errfinish (dummy=<optimized out>) at elog.c:501
> #3  0x0000000000488956 in SlruReportIOError (ctl=<optimized out>,
> pageno=<optimized out>, xid=361788644) at slru.c:857
> #4  0x000000000048956a in SimpleLruReadPage (ctl=0x95bf80, pageno=176654,
> write_ok=1 '\001', xid=361788644) at slru.c:427
> #5  0x0000000000489f7f in RecordNewMultiXact (multi=361788644,
> offset=822580965, nxids=2, xids=0x800e5302c) at multixact.c:764
> #6  0x000000000048a126 in multixact_redo (lsn=..., record=0x800e53000) at
> multixact.c:2017
> #7  0x0000000000484bea in StartupXLOG () at xlog.c:6277
> #8  0x00000000004870a8 in StartupProcessMain () at xlog.c:9335
> #9  0x000000000049cab2 in AuxiliaryProcessMain (argc=2,
> argv=0x7fffffffe450) at bootstrap.c:412
> #10 0x00000000005ba926 in StartChildProcess (type=StartupProcess) at
> postmaster.c:4442
> #11 0x00000000005bccb2 in PostmasterMain (argc=5, argv=0x7fffffffea38) at
> postmaster.c:1093
> #12 0x0000000000566f3b in main (argc=5, argv=0x7fffffffea38) at main.c:188
>
> relevant database log:
> 2012-08-22 07:47:21.330 MSK 18155 @ from  [vxid:1/0 txid:0] []LOG:  00000:
> redo starts at 925/2C0004C8
> 2012-08-22 07:47:21.330 MSK 18155 @ from  [vxid:1/0 txid:0] []LOCATION:
> StartupXLOG, xlog.c:6202
> 2012-08-22 07:47:21.384 MSK 18155 @ from  [vxid:1/0 txid:0] []FATAL:
> XX000: could not access status of transaction 361788644
> 2012-08-22 07:47:21.384 MSK 18155 @ from  [vxid:1/0 txid:0] []DETAIL:
> Could not read from file "pg_multixact/offsets/1590" at offset 114688:
> Unknown error: 0.
> 2012-08-22 07:47:21.384 MSK 18155 @ from  [vxid:1/0 txid:0] []CONTEXT:
> xlog redo create multixact 361788644 offset 822580965: 1768508061 1768508078
> 2012-08-22 07:47:21.384 MSK 18155 @ from  [vxid:1/0 txid:0] []LOCATION:
> SlruReportIOError, slru.c:840
>
>
> Kind Regards,
> Maksym
>

And finally gdb backtrace on build without optimization
( ./configure --prefix=/usr/local/pgsql/tmp/postgresql/ --enable-debug
--enable-cassert CFLAGS='-O0' ):

#0  0x0000000800d42ab0 in exit () from /lib/libc.so.7
#1  0x00000000006aafc6 in proc_exit (code=1) at ipc.c:137
#2  0x00000000007bcbb1 in errfinish (dummy=0) at elog.c:501
#3  0x00000000004b7065 in SlruReportIOError (ctl=0xad9740, pageno=176654,
xid=361788644) at slru.c:836
#4  0x00000000004b6353 in SimpleLruReadPage (ctl=0xad9740, pageno=176654,
write_ok=1 '\001', xid=361788644) at slru.c:427
#5  0x00000000004b85dd in RecordNewMultiXact (multi=361788644,
offset=822580965, nxids=2, xids=0x800f5302c) at multixact.c:764
#6  0x00000000004b9dd1 in multixact_redo (lsn=..., record=0x800f53000) at
multixact.c:2017
#7  0x00000000004aef0c in StartupXLOG () at xlog.c:6277
#8  0x00000000004b446d in StartupProcessMain () at xlog.c:9335
#9  0x00000000004d9e36 in AuxiliaryProcessMain (argc=2,
argv=0x7fffffffe8b0) at bootstrap.c:412
#10 0x0000000000683398 in StartChildProcess (type=StartupProcess) at
postmaster.c:4442
#11 0x000000000067ef60 in PostmasterMain (argc=5, argv=0x7fffffffea38) at
postmaster.c:1093
#12 0x00000000005fbf5a in main (argc=5, argv=0x7fffffffea38) at main.c:188

It seems failing around function SlruPhysicalReadPage at block

        errno = 0;
        if (read(fd, shared->page_buffer[slotno], BLCKSZ) != BLCKSZ)
        {
                slru_errcause = SLRU_READ_FAILED;
                slru_errno = errno;
                close(fd);
                return false;
        }

It seems that it read less than BLCKSZ bytes. And it is not an error
condition for read call
(from man 2 read:
Note that reading less than size bytes is not an error.
).

Actual size of pg_multixact/offsets/1590 file is:
[pgsql@db1 ~/tmp/postgresql-9.0.8]$ ls -la
/db/data_back/pg_multixact/offsets/1590
-rw-------  1 pgsql  pgsql  81920 Aug 21 07:46
/db/data_back/pg_multixact/offsets/1590

Now it seems that the (lseek(fd, (off_t) offset, SEEK_SET) < 0) call moving
seek position to outside of an actual file size.
And after that, read call actually could read only 0 bytes from it.

Does that mean that file was damaged during rsync?


Kind Regards,
Maksym

What next I should examine?

Kind Regards,
Maksym

Re: BUG #7500: hot-standby replica crash after an initial rsync

От
Andres Freund
Дата:
Hi Maxim,

On Wednesday, August 22, 2012 06:18:14 AM Maxim Boguk wrote:
> On Wed, Aug 22, 2012 at 1:50 PM, Maxim Boguk <maxim.boguk@gmail.com> wrote:
> > On Wed, Aug 22, 2012 at 6:08 AM, Andres Freund
<andres@2ndquadrant.com>wrote:
> >> On Tuesday, August 21, 2012 03:30:44 PM Maxim Boguk wrote:
> >> > Hi Andres,
> >>
> >> I would add something akin to
> >>
> >> elog(WARNING, "pid of startup is: %d, sleeping for 10s", getpid());
> >> sleep(10);
> >
> > Hi Andres,
> >
> > There is it:
> >
> > gdb backtrace:
> >
> > #0  0x0000000800bd1ab0 in exit () from /lib/libc.so.7
> > #1  0x00000000005d77f6 in proc_exit (code=1) at ipc.c:137
> > #2  0x0000000000697948 in errfinish (dummy=<optimized out>) at elog.c:501
> > #3  0x0000000000488956 in SlruReportIOError (ctl=<optimized out>,
> > pageno=<optimized out>, xid=361788644) at slru.c:857
> > #4  0x000000000048956a in SimpleLruReadPage (ctl=0x95bf80, pageno=176654,
> > write_ok=1 '\001', xid=361788644) at slru.c:427
> > #5  0x0000000000489f7f in RecordNewMultiXact (multi=361788644,
> > offset=822580965, nxids=2, xids=0x800e5302c) at multixact.c:764
> > #6  0x000000000048a126 in multixact_redo (lsn=..., record=0x800e53000) at
> > multixact.c:2017
> > #7  0x0000000000484bea in StartupXLOG () at xlog.c:6277
> > #8  0x00000000004870a8 in StartupProcessMain () at xlog.c:9335
> > #9  0x000000000049cab2 in AuxiliaryProcessMain (argc=2,
> > argv=0x7fffffffe450) at bootstrap.c:412
> > #10 0x00000000005ba926 in StartChildProcess (type=StartupProcess) at
> > postmaster.c:4442
> > #11 0x00000000005bccb2 in PostmasterMain (argc=5, argv=0x7fffffffea38) at
> > postmaster.c:1093
> > #12 0x0000000000566f3b in main (argc=5, argv=0x7fffffffea38) at
> > main.c:188
> >
> > relevant database log:
> > 2012-08-22 07:47:21.330 MSK 18155 @ from  [vxid:1/0 txid:0] []LOG:
> > 00000: redo starts at 925/2C0004C8
> > 2012-08-22 07:47:21.330 MSK 18155 @ from  [vxid:1/0 txid:0] []LOCATION:
> > StartupXLOG, xlog.c:6202
> > 2012-08-22 07:47:21.384 MSK 18155 @ from  [vxid:1/0 txid:0] []FATAL:
> > XX000: could not access status of transaction 361788644
> > 2012-08-22 07:47:21.384 MSK 18155 @ from  [vxid:1/0 txid:0] []DETAIL:
> > Could not read from file "pg_multixact/offsets/1590" at offset 114688:
> > Unknown error: 0.
> > 2012-08-22 07:47:21.384 MSK 18155 @ from  [vxid:1/0 txid:0] []CONTEXT:
> > xlog redo create multixact 361788644 offset 822580965: 1768508061
> > 1768508078 2012-08-22 07:47:21.384 MSK 18155 @ from  [vxid:1/0 txid:0]
> > []LOCATION: SlruReportIOError, slru.c:840
> >
> >
> > Kind Regards,
> > Maksym
>
> And finally gdb backtrace on build without optimization
> ( ./configure --prefix=/usr/local/pgsql/tmp/postgresql/ --enable-debug
> --enable-cassert CFLAGS='-O0' ):
>
> #0  0x0000000800d42ab0 in exit () from /lib/libc.so.7
> #1  0x00000000006aafc6 in proc_exit (code=1) at ipc.c:137
> #2  0x00000000007bcbb1 in errfinish (dummy=0) at elog.c:501
> #3  0x00000000004b7065 in SlruReportIOError (ctl=0xad9740, pageno=176654,
> xid=361788644) at slru.c:836
> #4  0x00000000004b6353 in SimpleLruReadPage (ctl=0xad9740, pageno=176654,
> write_ok=1 '\001', xid=361788644) at slru.c:427
> #5  0x00000000004b85dd in RecordNewMultiXact (multi=361788644,
> offset=822580965, nxids=2, xids=0x800f5302c) at multixact.c:764
> #6  0x00000000004b9dd1 in multixact_redo (lsn=..., record=0x800f53000) at
> multixact.c:2017
> #7  0x00000000004aef0c in StartupXLOG () at xlog.c:6277
> #8  0x00000000004b446d in StartupProcessMain () at xlog.c:9335
> #9  0x00000000004d9e36 in AuxiliaryProcessMain (argc=2,
> argv=0x7fffffffe8b0) at bootstrap.c:412
> #10 0x0000000000683398 in StartChildProcess (type=StartupProcess) at
> postmaster.c:4442
> #11 0x000000000067ef60 in PostmasterMain (argc=5, argv=0x7fffffffea38) at
> postmaster.c:1093
> #12 0x00000000005fbf5a in main (argc=5, argv=0x7fffffffea38) at main.c:188
Thanks! That already helps.

> It seems failing around function SlruPhysicalReadPage at block
>
>         errno = 0;
>         if (read(fd, shared->page_buffer[slotno], BLCKSZ) != BLCKSZ)
>         {
>                 slru_errcause = SLRU_READ_FAILED;
>                 slru_errno = errno;
>                 close(fd);
>                 return false;
>         }
>
> It seems that it read less than BLCKSZ bytes. And it is not an error
> condition for read call
> (from man 2 read:
> Note that reading less than size bytes is not an error.
> ).
If there are enough bytes available and youre not reading from a pipe or
similar it shouldn't happen though (unless signals happen, those change errno
though). But you already cleared up why its happening:

> Actual size of pg_multixact/offsets/1590 file is:
> [pgsql@db1 ~/tmp/postgresql-9.0.8]$ ls -la
> /db/data_back/pg_multixact/offsets/1590
> -rw-------  1 pgsql  pgsql  81920 Aug 21 07:46
> /db/data_back/pg_multixact/offsets/1590
>
> Now it seems that the (lseek(fd, (off_t) offset, SEEK_SET) < 0) call moving
> seek position to outside of an actual file size.
> And after that, read call actually could read only 0 bytes from it.
That does sound plausible, yes.

> Does that mean that file was damaged during rsync?
Not necessarily. When did you initially set up that cluster? Normally the file
should get zeroed out before its being used. If the cluster was copied
improperly (i.e. no pg_start/stop backup or such) it could easily happen. But
I would guess that you did it properly?

Could you show pg_controldata output for primary/replica? To see how new that
multixact offset is.

> What next I should examine?
If you could give me a DEBUG3 log from startup till it crashes it would be
helpful. Please #define MULTIXACT_DEBUG ontop of multixact.c before doing so.
Thats probably going to be to big to go through the mailinglist.

If youre feeling adventurous you could use the xlogdump project
(https://github.com/snaga/xlogdump) to see what xlog records happen from the
time it starts up till it crashes. I don't immediately see a way a
XLOG_MULTIXACT_CREATE_ID can be generated without a preceding
XLOG_MULTIXACT_ZERO_OFF_PAGE zeroing the page.
But most of that should be already be visible in the log with the above define.

Did you have any crashes, -m immediate shutdowns, or similar on the primary or
replica?

Greetings,

Andres
--
Andres Freund        http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #7500: hot-standby replica crash after an initial rsync

От
Maxim Boguk
Дата:
> > Does that mean that file was damaged during rsync?
> Not necessarily. When did you initially set up that cluster? Normally the
> file
> should get zeroed out before its being used. If the cluster was copied
> improperly (i.e. no pg_start/stop backup or such) it could easily happen.
> But
> I would guess that you did it properly?
>

That server was master before planned switchover to replica server was
performed.
Procedure:
1)stop former master
2)trigger replica failover (thus server become new master)

There are unusual part because human error from system administrator:
2.1)start former master
2.2)stop former master

3)empty pg_xlog directory on former master
4)pg_start_backup on new master
5)re-resync datafiles to former master using:
rsync -aq --delete --exclude pg_xlog /db/data [former master]:/db/
6)pg_stop_backup on new master
7)start former master as replica...

that procedure used more than 10 times without any problem in the past.
May be I missing something in it.



>
> Could you show pg_controldata output for primary/replica? To see how new
> that
> multixact offset is.
>

On failing replica:
[pgsql@db1 ~/tmp/postgresql-9.0.8]$ pg_controldata /db/data_back
pg_control version number:            903
Catalog version number:               201008051
Database system identifier:           5687760461964967531
Database cluster state:               in archive recovery
pg_control last modified:             Wed Aug 22 13:47:35 2012
Latest checkpoint location:           925/2D41CAA0
Prior checkpoint location:            925/2D41CAA0
Latest checkpoint's REDO location:    925/2C0004C8
Latest checkpoint's TimeLineID:       5
Latest checkpoint's NextXID:          0/1768511856
Latest checkpoint's NextOID:          59880938
Latest checkpoint's NextMultiXactId:  361788644
Latest checkpoint's NextMultiOffset:  822580965
Latest checkpoint's oldestXID:        1569182525
Latest checkpoint's oldestXID's DB:   16424
Latest checkpoint's oldestActiveXID:  1768498919
Time of latest checkpoint:            Tue Aug 21 08:21:42 2012
Minimum recovery ending location:     925/2C0004C8
Backup start location:                925/2C0004C8
Current wal_level setting:            hot_standby
Current max_connections setting:      300
Current max_prepared_xacts setting:   0
Current max_locks_per_xact setting:   2048
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value

On master no sense to look into it because master 1 day ahead of that
replica.
It's a production system.



>
> > What next I should examine?
> If you could give me a DEBUG3 log from startup till it crashes it would be
> helpful. Please #define MULTIXACT_DEBUG ontop of multixact.c before doing
> so.
> Thats probably going to be to big to go through the mailinglist.
>

I see nothing new after I rebuit postgres with that define   (
/usr/local/pgsql/tmp/postgresql/bin/postgres -d 3 -D /db/data_back/ ).
Just:

LOCATION:  readRecoveryCommandFile, xlog.c:5269
DEBUG:  00000: primary_conninfo = 'host=10.12.22.19 port=5433 user=replica
password=acilper'
LOCATION:  readRecoveryCommandFile, xlog.c:5276
DEBUG:  00000: trigger_file = '/db/data/failover'
LOCATION:  readRecoveryCommandFile, xlog.c:5283
LOG:  00000: entering standby mode
LOCATION:  StartupXLOG, xlog.c:5827
DEBUG:  00000: checkpoint record is at 925/2D41CAA0
LOCATION:  StartupXLOG, xlog.c:5894
DEBUG:  00000: redo record is at 925/2C0004C8; shutdown FALSE
LOCATION:  StartupXLOG, xlog.c:5929
DEBUG:  00000: next transaction ID: 0/1768511856; next OID: 59880938
LOCATION:  StartupXLOG, xlog.c:5933
DEBUG:  00000: next MultiXactId: 361788644; next MultiXactOffset: 822580965
LOCATION:  StartupXLOG, xlog.c:5936
DEBUG:  00000: oldest unfrozen transaction ID: 1569182525, in database 16424
LOCATION:  StartupXLOG, xlog.c:5939
DEBUG:  00000: MultiXact: setting next multi to 361788644 offset 822580965
LOCATION:  MultiXactSetNextMXact, multixact.c:1668
DEBUG:  00000: transaction ID wrap limit is 3716666172, limited by database
with OID 16424
LOCATION:  SetTransactionIdLimit, varsup.c:334
WARNING:  01000: pid of startup is: 16919, sleeping for 60s
LOCATION:  StartupXLOG, xlog.c:5983
DEBUG:  00000: initializing for hot standby
LOCATION:  StartupXLOG, xlog.c:6073
LOG:  00000: redo starts at 925/2C0004C8
LOCATION:  StartupXLOG, xlog.c:6202
FATAL:  XX000: could not access status of transaction 361788644
DETAIL:  Could not read from file "pg_multixact/offsets/1590" at offset
114688: Unknown error: 0.
CONTEXT:  xlog redo create multixact 361788644 offset 822580965: 1768508061
1768508078
LOCATION:  SlruReportIOError, slru.c:840




>
> If youre feeling adventurous you could use the xlogdump project
> (https://github.com/snaga/xlogdump) to see what xlog records happen from
> the
> time it starts up till it crashes. I don't immediately see a way a
> XLOG_MULTIXACT_CREATE_ID can be generated without a preceding
> XLOG_MULTIXACT_ZERO_OFF_PAGE zeroing the page.
> But most of that should be already be visible in the log with the above
> define.
>
> Did you have any crashes, -m immediate shutdowns, or similar on the
> primary or
> replica?
>

I don't think. But I'm not sure because task was performed by system
administrator using provided instruction.
And I was informed only after replica doesn't start.



Kind Regards,
Maksym

Re: BUG #7500: hot-standby replica crash after an initial rsync

От
Andres Freund
Дата:
On Wednesday, August 22, 2012 12:08:25 PM Maxim Boguk wrote:
> > > Does that mean that file was damaged during rsync?
> >
> > Not necessarily. When did you initially set up that cluster? Normally the
> > file
> > should get zeroed out before its being used. If the cluster was copied
> > improperly (i.e. no pg_start/stop backup or such) it could easily happen.
> > But
> > I would guess that you did it properly?
>
> That server was master before planned switchover to replica server was
> performed.
But it was never started up before, do I understand that correctly?

Greetings,

Andres
--
Andres Freund        http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #7500: hot-standby replica crash after an initial rsync

От
Stuart Bishop
Дата:
I believe I just hit this same issue, but with PG 9.1.3:

<@:32407> 2012-08-29 10:02:09 UTC LOG:  shutting down
<@:32407> 2012-08-29 10:02:09 UTC LOG:  database system is shut down
<[unknown]@[unknown]:31687> 2012-08-29 13:34:03 UTC LOG:  connection
received: host=[local]
<[unknown]@[unknown]:31687> 2012-08-29 13:34:03 UTC LOG:  incomplete
startup packet
<@:31686> 2012-08-29 13:34:03 UTC LOG:  database system was
interrupted; last known up at 2012-08-29 13:14:47 UTC
<@:31686> 2012-08-29 13:34:03 UTC LOG:  entering standby mode
<@:31686> 2012-08-29 13:34:03 UTC LOG:  redo starts at A92/5F000020
<@:31686> 2012-08-29 13:34:03 UTC FATAL:  could not access status of
transaction 208177034
<@:31686> 2012-08-29 13:34:03 UTC DETAIL:  Could not read from file
"pg_multixact/offsets/0C68" at offset 131072: Success.
<@:31686> 2012-08-29 13:34:03 UTC CONTEXT:  xlog redo create multixact
208177034 offset 1028958730: 1593544329 1593544330
<@:31681> 2012-08-29 13:34:03 UTC LOG:  startup process (PID 31686)
exited with exit code 1
<@:31681> 2012-08-29 13:34:03 UTC LOG:  terminating any other active
server processes

This was attempting to rebuild a hot standby after switching my master
to a new server. In between the shutdown and the attempt to restart:

 - The master was put into backup mode.
 - The datadir was rsynced over, using rsync -ahhP --delete-before
--exclude=postmaster.pid --exclude=pg_xlog
 - The master was taken out of backup mode.
 - The pg_xlog directory was emptied
 - The pg_xlog directory was rsynced across from the master. This
included all the WAL files from before the promotion, throughout
backup mode, and a few from after backup mode was left.
 - recovery.conf was created (hot_standby=on, primary_conninfo=...,
recovery_target_timeline=latest)
 - The hot standby was started.

Unfortunately I didn't have the disk space around to keep everything,
and am attempting the rebuild again now.

--
Stuart Bishop <stuart@stuartbishop.net>
http://www.stuartbishop.net/

Re: BUG #7500: hot-standby replica crash after an initial rsync

От
Andres Freund
Дата:
On Wednesday, August 29, 2012 05:32:31 PM Stuart Bishop wrote:
> I believe I just hit this same issue, but with PG 9.1.3:
>
> <@:32407> 2012-08-29 10:02:09 UTC LOG:  shutting down
> <@:32407> 2012-08-29 10:02:09 UTC LOG:  database system is shut down
> <[unknown]@[unknown]:31687> 2012-08-29 13:34:03 UTC LOG:  connection
> received: host=[local]
> <[unknown]@[unknown]:31687> 2012-08-29 13:34:03 UTC LOG:  incomplete
> startup packet
> <@:31686> 2012-08-29 13:34:03 UTC LOG:  database system was
> interrupted; last known up at 2012-08-29 13:14:47 UTC
> <@:31686> 2012-08-29 13:34:03 UTC LOG:  entering standby mode
> <@:31686> 2012-08-29 13:34:03 UTC LOG:  redo starts at A92/5F000020
> <@:31686> 2012-08-29 13:34:03 UTC FATAL:  could not access status of
> transaction 208177034
> <@:31686> 2012-08-29 13:34:03 UTC DETAIL:  Could not read from file
> "pg_multixact/offsets/0C68" at offset 131072: Success.
> <@:31686> 2012-08-29 13:34:03 UTC CONTEXT:  xlog redo create multixact
> 208177034 offset 1028958730: 1593544329 1593544330
> <@:31681> 2012-08-29 13:34:03 UTC LOG:  startup process (PID 31686)
> exited with exit code 1
> <@:31681> 2012-08-29 13:34:03 UTC LOG:  terminating any other active
> server processes
>
> This was attempting to rebuild a hot standby after switching my master
> to a new server. In between the shutdown and the attempt to restart:
>
>  - The master was put into backup mode.
>  - The datadir was rsynced over, using rsync -ahhP --delete-before
> --exclude=postmaster.pid --exclude=pg_xlog
>  - The master was taken out of backup mode.
>  - The pg_xlog directory was emptied
>  - The pg_xlog directory was rsynced across from the master. This
> included all the WAL files from before the promotion, throughout
> backup mode, and a few from after backup mode was left.
Thats not valid, you cannot easily guarantee that youve not copied files that
were in the progress of being written to. Use a recovery_command if you do not
want all files to be transferred via the replication connection. But do that
only for files that have been archived via an archive_command beforehand.

Did you have a backup label in the rsync'ed datadir? In Maxim's case I could
detect that he had not via line numbers, but I do not see them here...

Greetings,

Andres
--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #7500: hot-standby replica crash after an initial rsync

От
Stuart Bishop
Дата:
On Wed, Aug 29, 2012 at 10:59 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On Wednesday, August 29, 2012 05:32:31 PM Stuart Bishop wrote:
>> I believe I just hit this same issue, but with PG 9.1.3:
>>
>> <@:32407> 2012-08-29 10:02:09 UTC LOG:  shutting down
>> <@:32407> 2012-08-29 10:02:09 UTC LOG:  database system is shut down
>> <[unknown]@[unknown]:31687> 2012-08-29 13:34:03 UTC LOG:  connection
>> received: host=[local]
>> <[unknown]@[unknown]:31687> 2012-08-29 13:34:03 UTC LOG:  incomplete
>> startup packet
>> <@:31686> 2012-08-29 13:34:03 UTC LOG:  database system was
>> interrupted; last known up at 2012-08-29 13:14:47 UTC
>> <@:31686> 2012-08-29 13:34:03 UTC LOG:  entering standby mode
>> <@:31686> 2012-08-29 13:34:03 UTC LOG:  redo starts at A92/5F000020
>> <@:31686> 2012-08-29 13:34:03 UTC FATAL:  could not access status of
>> transaction 208177034
>> <@:31686> 2012-08-29 13:34:03 UTC DETAIL:  Could not read from file
>> "pg_multixact/offsets/0C68" at offset 131072: Success.
>> <@:31686> 2012-08-29 13:34:03 UTC CONTEXT:  xlog redo create multixact
>> 208177034 offset 1028958730: 1593544329 1593544330
>> <@:31681> 2012-08-29 13:34:03 UTC LOG:  startup process (PID 31686)
>> exited with exit code 1
>> <@:31681> 2012-08-29 13:34:03 UTC LOG:  terminating any other active
>> server processes
>>
>> This was attempting to rebuild a hot standby after switching my master
>> to a new server. In between the shutdown and the attempt to restart:
>>
>>  - The master was put into backup mode.
>>  - The datadir was rsynced over, using rsync -ahhP --delete-before
>> --exclude=postmaster.pid --exclude=pg_xlog
>>  - The master was taken out of backup mode.
>>  - The pg_xlog directory was emptied
>>  - The pg_xlog directory was rsynced across from the master. This
>> included all the WAL files from before the promotion, throughout
>> backup mode, and a few from after backup mode was left.
> Thats not valid, you cannot easily guarantee that youve not copied files that
> were in the progress of being written to. Use a recovery_command if you do not
> want all files to be transferred via the replication connection. But do that
> only for files that have been archived via an archive_command beforehand.

Ok. I had assumed this was fine, as the docs explicitly tell me to
copy across any unarchived WAL files when doing failover. I think my
confusion is because the docs for building a standby refer to the
section on recovering from a backup, but I have a live server.

I'll just let the WAL files get sucked over the replication connection
if that works - this seems much simpler.  I don't think I saw this
mentioned in the docs. I had been assuming enough WAL needed to be
available to bring the DB up to a consistent state before streaming
replication would start.


> Did you have a backup label in the rsync'ed datadir? In Maxim's case I could
> detect that he had not via line numbers, but I do not see them here...

Yes, the backup_label copied across (confirmed in scrollback from the rsync).


--
Stuart Bishop <stuart@stuartbishop.net>
http://www.stuartbishop.net/