Обсуждение: pgsql: Ensure that a standby is able to follow a primary on a newer tim

Поиск
Список
Период
Сортировка

pgsql: Ensure that a standby is able to follow a primary on a newer tim

От
Fujii Masao
Дата:
Ensure that a standby is able to follow a primary on a newer timeline.

Commit 709d003fbd refactored WAL-reading code, but accidentally caused
WalSndSegmentOpen() to fail to follow a timeline switch while reading from
a historic timeline. This issue caused a standby to fail to follow a primary
on a newer timeline when WAL archiving is enabled.

If there is a timeline switch within the segment, WalSndSegmentOpen() should
read from the WAL segment belonging to the new timeline. But previously
since it failed to follow a timeline switch, it tried to read the WAL segment
with old timeline. When WAL archiving is enabled, that WAL segment with
old timeline doesn't exist because it's renamed to .partial. This leads
a primary to have tried to read non-existent WAL segment, and which caused
replication to faill with the error "ERROR:  requested WAL segment ... has
 already been removed".

This commit fixes WalSndSegmentOpen() so that it's able to follow a timeline
switch, to ensure that a standby is able to follow a primary on a newer
timeline even when WAL archiving is enabled.

This commit also adds the regression test to check whether a standby is
able to follow a primary on a newer timeline when WAL archiving is enabled.

Back-patch to v13 where the bug was introduced.

Reported-by: Kyotaro Horiguchi
Author: Kyotaro Horiguchi, tweaked by Fujii Masao
Reviewed-by:  Alvaro Herrera, Fujii Masao
Discussion: https://postgr.es/m/20201209.174314.282492377848029776.horikyota.ntt@gmail.com

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/fef5b47f6bfc9bfec619bb2e6e66b027e7ff21a3

Modified Files
--------------
src/backend/replication/walsender.c        |  2 +-
src/test/recovery/t/004_timeline_switch.pl | 42 +++++++++++++++++++++++++++---
2 files changed, 40 insertions(+), 4 deletions(-)


Re: pgsql: Ensure that a standby is able to follow a primary on a newer tim

От
Michael Paquier
Дата:
Hi Fujii-san,

On Thu, Jan 14, 2021 at 03:32:52AM +0000, Fujii Masao wrote:
> Ensure that a standby is able to follow a primary on a newer timeline.
>
> Commit 709d003fbd refactored WAL-reading code, but accidentally caused
> WalSndSegmentOpen() to fail to follow a timeline switch while reading from
> a historic timeline. This issue caused a standby to fail to follow a primary
> on a newer timeline when WAL archiving is enabled.

florican is telling that this test has some stability problems:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican&dt=2021-01-14%2003%3A55%3A45

Here I can see that replication keeps asking for a segment that's
already gone:
2021-01-13 23:34:52.104 EST [64611:1] LOG:  started streaming WAL from
primary at 0/3000000 on timeline 1
2021-01-13 23:34:52.104 EST [64611:2] FATAL:  could not receive data
from WAL stream: ERROR:  requested WAL segment
000000010000000000000003 has already been removed

At quick glance, it seems to be an issue after the promotion of
primary_2 before starting standby_3.
--
Michael

Вложения

Re: pgsql: Ensure that a standby is able to follow a primary on a newer tim

От
Fujii Masao
Дата:

On 2021/01/14 13:59, Michael Paquier wrote:
> Hi Fujii-san,
> 
> On Thu, Jan 14, 2021 at 03:32:52AM +0000, Fujii Masao wrote:
>> Ensure that a standby is able to follow a primary on a newer timeline.
>>
>> Commit 709d003fbd refactored WAL-reading code, but accidentally caused
>> WalSndSegmentOpen() to fail to follow a timeline switch while reading from
>> a historic timeline. This issue caused a standby to fail to follow a primary
>> on a newer timeline when WAL archiving is enabled.
> 
> florican is telling that this test has some stability problems:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican&dt=2021-01-14%2003%3A55%3A45
> 
> Here I can see that replication keeps asking for a segment that's
> already gone:
> 2021-01-13 23:34:52.104 EST [64611:1] LOG:  started streaming WAL from
> primary at 0/3000000 on timeline 1
> 2021-01-13 23:34:52.104 EST [64611:2] FATAL:  could not receive data
> from WAL stream: ERROR:  requested WAL segment
> 000000010000000000000003 has already been removed

Thanks for reporting this! I'm looking at this issue.

My guess is that the requested WAL file was removed unfortunately by
checkpoint because no replication slot is used and wal_keep_size is not set.
So easy fix is to set wal_keep_size to 512MB or other in that test. Thought?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: pgsql: Ensure that a standby is able to follow a primary on a newer tim

От
Tom Lane
Дата:
Fujii Masao <masao.fujii@oss.nttdata.com> writes:
> On 2021/01/14 13:59, Michael Paquier wrote:
>> florican is telling that this test has some stability problems:
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican&dt=2021-01-14%2003%3A55%3A45

> My guess is that the requested WAL file was removed unfortunately by
> checkpoint because no replication slot is used and wal_keep_size is not set.
> So easy fix is to set wal_keep_size to 512MB or other in that test. Thought?

florican did pass this test on the v13 branch, so I agree it's probably
a timing issue not any deeper bug.  Your theory seems plausible.

            regards, tom lane



Re: pgsql: Ensure that a standby is able to follow a primary on a newer tim

От
Fujii Masao
Дата:

On 2021/01/14 14:23, Tom Lane wrote:
> Fujii Masao <masao.fujii@oss.nttdata.com> writes:
>> On 2021/01/14 13:59, Michael Paquier wrote:
>>> florican is telling that this test has some stability problems:
>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican&dt=2021-01-14%2003%3A55%3A45
> 
>> My guess is that the requested WAL file was removed unfortunately by
>> checkpoint because no replication slot is used and wal_keep_size is not set.
>> So easy fix is to set wal_keep_size to 512MB or other in that test. Thought?
> 
> florican did pass this test on the v13 branch, so I agree it's probably
> a timing issue not any deeper bug.  Your theory seems plausible.

Thanks for the check!

So, barring any objection, I will push the attached patch that sets
wal_keep_size in the test.

BTW, I included the URL to Michael's report [1] in the commit log. But this
URL doesn't seem to work fine maybe because <message-id> part includes
a slash character.

[1]
https://postgr.es/m/X//PsenxcC50jDzX@paquier.xyz

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Вложения