Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

Поиск
Список
Период
Сортировка
От Thomas Munro
Тема Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders
Дата
Msg-id CAEepm=1KQyN+jd+r9X9L2D1zQhH+yofaoUQzVpTBhKm4NG90Yw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders
Список pgsql-hackers
On Sun, Apr 23, 2017 at 6:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>> On Sun, Apr 23, 2017 at 3:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> As for this patch itself, is it reasonable to try to assert that the
>>> timeline has in fact changed?
>
>> The protocol doesn't include the timeline in reply messages, so it's
>> not clear how the upstream server would know what timeline the standby
>> thinks it's dealing with in any given reply message.  The sending
>> server has its own idea of the current timeline but it's not in sync
>> with the stream of incoming replies.
>
> Fair enough.  But I'd still like an explanation of why only about
> half of the population is showing a failure here.  Seems like every
> machine should be seeing the LSN as moving backwards in this test.
> So (a) why aren't they all failing, and (b) should we change the
> test to make sure every platform sees that happening?

Every machine sees the LSN moving backwards, but the code path that
had the assertion only reached if it decides to interpolate, which is
timing dependent: there needs to be a future sample in the lag
tracking buffer, which I guess is not the case in those runs.  For
example, "jacana" is one of the animals that reported the assertion
failure for 009_twophase.pl but not the one for
004_timeline_switch.pl, in this run:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2017-04-20%2021%3A00%3A20

This standby2 log fragment corresponds to the LSN zigzag manoeuvre:

2017-04-20 22:04:51.744 GMT [1984] LOG:  fetching timeline history
file for timeline 2 from primary server
2017-04-20 22:04:51.744 GMT [1984] LOG:  started streaming WAL from
primary at 0/3000000 on timeline 1
2017-04-20 22:04:51.744 GMT [1984] LOG:  replication terminated by
primary server
2017-04-20 22:04:51.744 GMT [1984] DETAIL:  End of WAL reached on
timeline 1 at 0/3028D50.
2017-04-20 22:04:51.744 GMT [1708] LOG:  new target timeline is 2
2017-04-20 22:04:51.744 GMT [1984] LOG:  restarted WAL streaming at
0/3000000 on timeline 2

It looks approximately the same on my development machine where I got
the assertion failure until commit 84638808:

2017-04-24 09:04:46.563 NZST [95164] LOG:  fetching timeline history
file for timeline 2 from primary server
2017-04-24 09:04:46.563 NZST [95164] LOG:  started streaming WAL from
primary at 0/3000000 on timeline 1
2017-04-24 09:04:46.564 NZST [95164] LOG:  replication terminated by
primary server
2017-04-24 09:04:46.564 NZST [95164] DETAIL:  End of WAL reached on
timeline 1 at 0/3028470.
2017-04-24 09:04:46.565 NZST [95160] LOG:  new target timeline is 2
2017-04-24 09:04:46.565 NZST [95164] LOG:  restarted WAL streaming at
0/3000000 on timeline 2

-- 
Thomas Munro
http://www.enterprisedb.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andrew Dunstan
Дата:
Сообщение: [HACKERS] TAP tests - installcheck vs check
Следующее
От: Andres Freund
Дата:
Сообщение: Re: [HACKERS] valgrind error in subscription code