Re: [bug fix] Cascading standby cannot catch up and get stuck emitting the same message repeatedly

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: [bug fix] Cascading standby cannot catch up and get stuck emitting the same message repeatedly
Дата
Msg-id CA+Tgmobh2tQpAL10D6F04kCMoU1GHmJfPtRO4yLPV6tDnCVtcg@mail.gmail.com
обсуждение исходный текст
Ответ на [bug fix] Cascading standby cannot catch up and get stuck emitting the same message repeatedly  ("Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com>)
Ответы Re: [bug fix] Cascading standby cannot catch up and get stuck emitting the same message repeatedly
Список pgsql-hackers
On Thu, Aug 25, 2016 at 10:33 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
> The processing went as follows.
>
> 1. node1's timeline is 140.  It wrote a WAL record at the end of WAL segment 117.  The WAL record didn't fit the last
page,so it was split across segments 117 and 118. 
>
> 2. WAL segment 117 was archived.
>
> 3. node1 got down, and node2 was promoted.
>
> 4. As part of the recovery process, node2 retrieves WAL segment 117 from archive.  It found a WAL record fragment at
theend of the segment but could not find the remaining fragment in segment 118, so node2 stops recovery there. 
>
> LOG:  restored log file "0000008C0000028C00000075" from archive
> LOG:  received promote request
> LOG:  redo done at 28C/75FFF738
>
> 5. node2 becomes the primary, and its timeline becomes 118.  node3 is disconnected by node2 (but later reconnectes to
node2).
>
> LOG:  terminating all walsender processes to force cascaded standby(s) to update timeline and reconnect
>
> 6. node3 retrieves and applies WAL segment 117 from archive.
>
> LOG:  restored log file "0000008C0000028C00000075" from archive
>
> 7. node3 found .history file for time line 141 and renews its timeline to 141.
>
> 8. Because node3 found a WAL record fragment at the end of segment 117, it expects to find the remaining fragment at
thebeginning of WAL segment 118 streamed from node2.  But there was a fragment of a different WAL record, because node2
overwrotea different WAL record at the end of segment 117 across to 118. 
>
> LOG:  invalid contrecord length 5892 in log file 652, segment 118, offset 0
>
> 9. node3 then retrieves segment 117 from archive again to get the WAL record at the end of segment 117.  However, as
node3'stimeline is already 141, it complains about the older timeline when it sees the timeline 140 at the beginning of
segment117. 
>
> LOG:  out-of-sequence timeline ID 140 (after 141) in log file 652, segment 117, offset 0

OK.  I agree that's a problem.  However, your patch adds zero new
comment text while removing some existing comments, so I can't easily
tell how it solves that problem or whether it does so correctly.  Even
if I were smart enough to figure it out, I wouldn't want to rely on
the next person also being that smart.  This is obviously a subtle
problem in tricky code, so a clear explanation of the fix seems like a
very good idea.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: Remove "Source Code" column from \df+ ?
Следующее
От: Robert Haas
Дата:
Сообщение: Re: pg_dump, pg_dumpall and data durability