Обсуждение: Assertion failure at standby promotion

Поиск
Список
Период
Сортировка

Assertion failure at standby promotion

От
Fujii Masao
Дата:
Hi,

I got the following assertion failure when I promoted the standby.

2013-05-04 00:12:31 JST sby1 LOG:  received promote request
2013-05-04 00:12:31 JST sby1 FATAL:  terminating walreceiver process
due to administrator command
2013-05-04 00:12:31 JST sby1 LOG:  redo done at 0/6FFE038
2013-05-04 00:12:31 JST sby1 LOG:  last completed transaction was at
log time 2013-05-04 00:12:25.878909+09
2013-05-04 00:12:31 JST sby1 LOG:  selected new timeline ID: 2
2013-05-04 00:12:31 JST sby1 LOG:  archive recovery complete
2013-05-04 00:12:31 JST sby1 LOG:  checkpoint starting:
TRAP: FailedAssertion("!(sentPtr <= sendTimeLineValidUpto)", File:
"walsender.c", Line: 1465)
2013-05-04 00:12:31 JST sby1 LOG:  autovacuum launcher started

The way to reproduce this is:

1. Create one master A, one standby B, and one cascade standby C.
2. Run pgbench -i -s 10
3. Promote the standby B before pgbench -i finishes

Regards,

-- 
Fujii Masao



Re: Assertion failure at standby promotion

От
Amit Langote
Дата:
Hello,

I  tried reproducing the scenario. Note that I did not archive xlogs
(that is, archive_command = '/bin/true' and corresponding
restore_command = '/bin/false'). I performed the steps you mentioned
and could find following:

***** Log on standby-1:
[Standby-1]LOG:  database system was interrupted; last known up at
2013-05-05 14:05:08 IST
[Standby-1]LOG:  creating missing WAL directory "pg_xlog/archive_status"
[Standby-1]LOG:  entering standby mode
[Standby-1]LOG:  started streaming WAL from primary at 0/2000000 on timeline 1
[Standby-1]LOG:  redo starts at 0/2000024
[Standby-1]LOG:  consistent recovery state reached at 0/20000DC
[Standby-1]LOG:  database system is ready to accept read only connections
[Standby-1]LOG:  received promote request
[Standby-1]FATAL:  terminating walreceiver process due to administrator command
[Standby-1]LOG:  invalid magic number 0000 in log segment
000000010000000000000003, offset 5316608
[Standby-1]LOG:  redo done at 0/350F0B8
[Standby-1]LOG:  last completed transaction was at log time 2013-05-05
14:05:14.571492+05:30
[Standby-1]LOG:  selected new timeline ID: 2
[Standby-1]LOG:  archive recovery complete
>> [Standby-1]ERROR:  server switched off timeline 1 at 0/3510B14, but walsender already streamed up to 0/3512000
[Standby-1]LOG:  database system is ready to accept connections
[Standby-1]LOG:  autovacuum launcher started


****** Log on Standby-2:
[Standby-2]LOG:  database system was interrupted while in recovery at
log time 2013-05-05 14:05:07 IST
[Standby-2]HINT:  If this has occurred more than once some data might
be corrupted and you might need to choose an earlier recovery target.
[Standby-2]LOG:  creating missing WAL directory "pg_xlog/archive_status"
[Standby-2]LOG:  entering standby mode
[Standby-2]LOG:  started streaming WAL from primary at 0/2000000 on timeline 1
[Standby-2]LOG:  redo starts at 0/2000024
[Standby-2]LOG:  consistent recovery state reached at 0/3000000
[Standby-2]LOG:  database system is ready to accept read only connections
>> [Standby-2]FATAL:  could not receive data from WAL stream: ERROR:  server switched off timeline 1 at 0/3510B14, but
walsenderalready streamed up to 0/3512000
 

[Standby-2]LOG:  invalid magic number 0000 in log segment
000000010000000000000003, offset 5316608
[Standby-2]LOG:  fetching timeline history file for timeline 2 from
primary server
[Standby-2]LOG:  started streaming WAL from primary at 0/3000000 on timeline 1
[Standby-2]LOG:  replication terminated by primary server
[Standby-2]DETAIL:  End of WAL reached on timeline 1 at 0/3510B14
[Standby-2]LOG:  restarted WAL streaming at 0/3000000 on timeline 1
[Standby-2]LOG:  replication terminated by primary server
[Standby-2]DETAIL:  End of WAL reached on timeline 1 at 0/3510B14
[Standby-2]LOG:  restarted WAL streaming at 0/3000000 on timeline 1
[Standby-2]LOG:  replication terminated by primary server
[Standby-2]DETAIL:  End of WAL reached on timeline 1 at 0/3510B14
[Standby-2]LOG:  restarted WAL streaming at 0/3000000 on timeline 1
[Standby-2]LOG:  replication terminated by primary server
[Standby-2]DETAIL:  End of WAL reached on timeline 1 at 0/3510B14
[Standby-2]LOG:  restarted WAL streaming at 0/3000000 on timeline 1
[Standby-2]LOG:  replication terminated by primary server
[Standby-2]DETAIL:  End of WAL reached on timeline 1 at 0/3510B14
[Standby-2]LOG:  restarted WAL streaming at 0/3000000 on timeline 1
[Standby-2]LOG:  replication terminated by primary server
[Standby-2]DETAIL:  End of WAL reached on timeline 1 at 0/3510B14
[Standby-2]LOG:  restarted WAL streaming at 0/3000000 on timeline 1
[Standby-2]LOG:  replication terminated by primary server
[Standby-2]DETAIL:  End of WAL reached on timeline 1 at 0/3510B14
[Standby-2]LOG:  restarted WAL streaming at 0/3000000 on timeline 1
[Standby-2]LOG:  replication terminated by primary server
[Standby-2]DETAIL:  End of WAL reached on timeline 1 at 0/3510B14
...
...
...

****** Also, in the ps output, following is the state of wal sender
(standby-1) and wal receiver (standby-2)

amit      8084  5675  0 14:13 ?        00:00:00 postgres: wal receiver
process   restarting at 0/3000000
amit      8085  5648  0 14:13 ?        00:00:00 postgres: wal sender
process amit [local] idle


Is this related to the assertion failure that you have reported?

--

Amit Langote



Re: Assertion failure at standby promotion

От
Heikki Linnakangas
Дата:
On 03.05.2013 18:17, Fujii Masao wrote:
> Hi,
>
> I got the following assertion failure when I promoted the standby.
>
> 2013-05-04 00:12:31 JST sby1 LOG:  received promote request
> 2013-05-04 00:12:31 JST sby1 FATAL:  terminating walreceiver process
> due to administrator command
> 2013-05-04 00:12:31 JST sby1 LOG:  redo done at 0/6FFE038
> 2013-05-04 00:12:31 JST sby1 LOG:  last completed transaction was at
> log time 2013-05-04 00:12:25.878909+09
> 2013-05-04 00:12:31 JST sby1 LOG:  selected new timeline ID: 2
> 2013-05-04 00:12:31 JST sby1 LOG:  archive recovery complete
> 2013-05-04 00:12:31 JST sby1 LOG:  checkpoint starting:
> TRAP: FailedAssertion("!(sentPtr<= sendTimeLineValidUpto)", File:
> "walsender.c", Line: 1465)
> 2013-05-04 00:12:31 JST sby1 LOG:  autovacuum launcher started
>
> The way to reproduce this is:
>
> 1. Create one master A, one standby B, and one cascade standby C.
> 2. Run pgbench -i -s 10
> 3. Promote the standby B before pgbench -i finishes

I was able to reproduce this. The assertion checks that if the system is 
promoted at WAL location X, we must not have already sent WAL at > X to 
the client. As the code stands, that assumption is wrong; the walsender 
will merrily stream WAL that hasn't been replayed yet, and the system 
can be promoted before replaying all the WAL that has been streamed to a 
cascading standby. The comment in GetStandbyFlushRecPtr(), which is the 
function that determined how far the WAL may be streamed to a cascading 
standby, says this:

>     /*
>      * We can safely send what's already been replayed. Also, if walreceiver
>      * is streaming WAL from the same timeline, we can send anything that
>      * it has streamed, but hasn't been replayed yet.
>      */

There seems to be two bugs here:

1. This used to work in 9.2, because the startup process would always 
replay all the WAL present in pg_xlog before promoting (the WAL present 
in pg_xlog was streamed from master). But the refactorings in xlog.c in 
9.3devel broke that, so that the startup process can promote earlier.

2. Even after fixing the logic in xlog.c, there is still a corner-case 
where the startup process can promote before all the WAL that has been 
received from walreceiver has been received. That happens if the WAL 
streaming is terminated at a page boundary, rather than at a record 
boundary. For example, we might have received WAL up to the page 
boundary at 0/5BFA000, but the last *complete* record that we have 
received ends at 0/5BF9BD8.

To fix the second issue, I think two things need to happen. First, we 
need to suppress the check in walsender. Second, we need to teach the 
WAL replay to back off when that happens. At the moment, the replay in 
the cascading standby gets stuck, trying to fetch the next page 
containing rest of the partial WAL record. Instead, it should throw away 
the partial record it has, and resync at the end of the last replayed 
record. I think 9.2 has the same bug, BTW. Without support for timeline 
switches over streaming replication, it was just more difficult to hit.

- Heikki



Re: Assertion failure at standby promotion

От
Heikki Linnakangas
Дата:
On 05.05.2013 12:13, Amit Langote wrote:
> Hello,
>
> I  tried reproducing the scenario. Note that I did not archive xlogs
> (that is, archive_command = '/bin/true' and corresponding
> restore_command = '/bin/false'). I performed the steps you mentioned
> and could find following:
> [snip]
>>> [Standby-2]FATAL:  could not receive data from WAL stream: ERROR:  server switched off timeline 1 at 0/3510B14, but
walsenderalready streamed up to 0/3512000
 

> Is this related to the assertion failure that you have reported?

Yep, same issue. If you compile with assertions disabled, you get that 
error instead of the assertion. They both check for the same thing, it 
was an oversight to have both.

- Heikki



Re: Assertion failure at standby promotion

От
Heikki Linnakangas
Дата:
On 06.05.2013 13:08, Heikki Linnakangas wrote:
> On 03.05.2013 18:17, Fujii Masao wrote:
>> Hi,
>>
>> I got the following assertion failure when I promoted the standby.
>>
>> 2013-05-04 00:12:31 JST sby1 LOG: received promote request
>> 2013-05-04 00:12:31 JST sby1 FATAL: terminating walreceiver process
>> due to administrator command
>> 2013-05-04 00:12:31 JST sby1 LOG: redo done at 0/6FFE038
>> 2013-05-04 00:12:31 JST sby1 LOG: last completed transaction was at
>> log time 2013-05-04 00:12:25.878909+09
>> 2013-05-04 00:12:31 JST sby1 LOG: selected new timeline ID: 2
>> 2013-05-04 00:12:31 JST sby1 LOG: archive recovery complete
>> 2013-05-04 00:12:31 JST sby1 LOG: checkpoint starting:
>> TRAP: FailedAssertion("!(sentPtr<= sendTimeLineValidUpto)", File:
>> "walsender.c", Line: 1465)
>> 2013-05-04 00:12:31 JST sby1 LOG: autovacuum launcher started
>>
>> The way to reproduce this is:
>>
>> 1. Create one master A, one standby B, and one cascade standby C.
>> 2. Run pgbench -i -s 10
>> 3. Promote the standby B before pgbench -i finishes
>
> I was able to reproduce this. The assertion checks that if the system is
> promoted at WAL location X, we must not have already sent WAL at > X to
> the client. As the code stands, that assumption is wrong; the walsender
> will merrily stream WAL that hasn't been replayed yet, and the system
> can be promoted before replaying all the WAL that has been streamed to a
> cascading standby. The comment in GetStandbyFlushRecPtr(), which is the
> function that determined how far the WAL may be streamed to a cascading
> standby, says this:
>
>> /*
>> * We can safely send what's already been replayed. Also, if walreceiver
>> * is streaming WAL from the same timeline, we can send anything that
>> * it has streamed, but hasn't been replayed yet.
>> */
>
> There seems to be two bugs here:
>
> 1. This used to work in 9.2, because the startup process would always
> replay all the WAL present in pg_xlog before promoting (the WAL present
> in pg_xlog was streamed from master). But the refactorings in xlog.c in
> 9.3devel broke that, so that the startup process can promote earlier.
>
> 2. Even after fixing the logic in xlog.c, there is still a corner-case
> where the startup process can promote before all the WAL that has been
> received from walreceiver has been received. That happens if the WAL
> streaming is terminated at a page boundary, rather than at a record
> boundary. For example, we might have received WAL up to the page
> boundary at 0/5BFA000, but the last *complete* record that we have
> received ends at 0/5BF9BD8.
>
> To fix the second issue, I think two things need to happen. First, we
> need to suppress the check in walsender. Second, we need to teach the
> WAL replay to back off when that happens. At the moment, the replay in
> the cascading standby gets stuck, trying to fetch the next page
> containing rest of the partial WAL record. Instead, it should throw away
> the partial record it has, and resync at the end of the last replayed
> record.

I came up with the attached patch for this
(fix-standby-promotion-assert-fail-2.patch). You will also need to apply
fast-promotion-quick-fix.patch to work around the bug in fast promotion
I reported here:
http://www.postgresql.org/message-id/5188CFFA.3020209@vmware.com.

pg_receivexlog has a variant of the same bug. If the server has sent WAL
up to segment boundary, and the segment boundary splits a WAL record in
half, and the server is then promoted so that the promotion checkpoint
(or end-of-recovery) record goes to the previous segment, pg_receivexlog
will not fetch the segment containing the checkpoint record correctly.
When following a timeline switch, it should restart streaming from the
new timeline at the point where the timeline switch occurred, rather
than at the point on the old timeline where it stopped. Usually it's the
same thing, but not if the streaming was paused at a page or segment
boundary.

To fix this for pg_receivexlog, I added the start position of the next
timeline to the result set that the server sends at end of WAL
streaming. Previously, it only sent the TLI of the next timeline, and it
was assumed that the starting point is the same as the point where
streaming was stopped.

It's a bit late to change it, but I have to say I don't like this whole
business of relaying WAL to a cascading standby that hasn't been
replayed. It would be a lot simpler if you could assume that whatever
gets streamed downstream has already been replayed upstream. I
understand that it's nice from a performance point of view, and because
a cascading standby isn't then lagged behind if the replay gets stalled
in the upstream server because of a hot standby conflict. But still..

I'm going to sleep over this and continue testing tomorrow. Please have
a look.

> I think 9.2 has the same bug, BTW. Without support for timeline
> switches over streaming replication, it was just more difficult to hit.

Looking closer, I believe 9.2 is OK. Recovery loops back to retry the
whole record correctly, it was just the timeline switch over streaming
replication code that got confused.

- Heikki

Вложения