Обсуждение: pgsql: Ensure that a standby is able to follow a primary on a newer tim
Ensure that a standby is able to follow a primary on a newer timeline. Commit 709d003fbd refactored WAL-reading code, but accidentally caused WalSndSegmentOpen() to fail to follow a timeline switch while reading from a historic timeline. This issue caused a standby to fail to follow a primary on a newer timeline when WAL archiving is enabled. If there is a timeline switch within the segment, WalSndSegmentOpen() should read from the WAL segment belonging to the new timeline. But previously since it failed to follow a timeline switch, it tried to read the WAL segment with old timeline. When WAL archiving is enabled, that WAL segment with old timeline doesn't exist because it's renamed to .partial. This leads a primary to have tried to read non-existent WAL segment, and which caused replication to faill with the error "ERROR: requested WAL segment ... has already been removed". This commit fixes WalSndSegmentOpen() so that it's able to follow a timeline switch, to ensure that a standby is able to follow a primary on a newer timeline even when WAL archiving is enabled. This commit also adds the regression test to check whether a standby is able to follow a primary on a newer timeline when WAL archiving is enabled. Back-patch to v13 where the bug was introduced. Reported-by: Kyotaro Horiguchi Author: Kyotaro Horiguchi, tweaked by Fujii Masao Reviewed-by: Alvaro Herrera, Fujii Masao Discussion: https://postgr.es/m/20201209.174314.282492377848029776.horikyota.ntt@gmail.com Branch ------ master Details ------- https://git.postgresql.org/pg/commitdiff/fef5b47f6bfc9bfec619bb2e6e66b027e7ff21a3 Modified Files -------------- src/backend/replication/walsender.c | 2 +- src/test/recovery/t/004_timeline_switch.pl | 42 +++++++++++++++++++++++++++--- 2 files changed, 40 insertions(+), 4 deletions(-)
Re: pgsql: Ensure that a standby is able to follow a primary on a newer tim
От
Michael Paquier
Дата:
Hi Fujii-san, On Thu, Jan 14, 2021 at 03:32:52AM +0000, Fujii Masao wrote: > Ensure that a standby is able to follow a primary on a newer timeline. > > Commit 709d003fbd refactored WAL-reading code, but accidentally caused > WalSndSegmentOpen() to fail to follow a timeline switch while reading from > a historic timeline. This issue caused a standby to fail to follow a primary > on a newer timeline when WAL archiving is enabled. florican is telling that this test has some stability problems: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican&dt=2021-01-14%2003%3A55%3A45 Here I can see that replication keeps asking for a segment that's already gone: 2021-01-13 23:34:52.104 EST [64611:1] LOG: started streaming WAL from primary at 0/3000000 on timeline 1 2021-01-13 23:34:52.104 EST [64611:2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000000000003 has already been removed At quick glance, it seems to be an issue after the promotion of primary_2 before starting standby_3. -- Michael
Вложения
On 2021/01/14 13:59, Michael Paquier wrote: > Hi Fujii-san, > > On Thu, Jan 14, 2021 at 03:32:52AM +0000, Fujii Masao wrote: >> Ensure that a standby is able to follow a primary on a newer timeline. >> >> Commit 709d003fbd refactored WAL-reading code, but accidentally caused >> WalSndSegmentOpen() to fail to follow a timeline switch while reading from >> a historic timeline. This issue caused a standby to fail to follow a primary >> on a newer timeline when WAL archiving is enabled. > > florican is telling that this test has some stability problems: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican&dt=2021-01-14%2003%3A55%3A45 > > Here I can see that replication keeps asking for a segment that's > already gone: > 2021-01-13 23:34:52.104 EST [64611:1] LOG: started streaming WAL from > primary at 0/3000000 on timeline 1 > 2021-01-13 23:34:52.104 EST [64611:2] FATAL: could not receive data > from WAL stream: ERROR: requested WAL segment > 000000010000000000000003 has already been removed Thanks for reporting this! I'm looking at this issue. My guess is that the requested WAL file was removed unfortunately by checkpoint because no replication slot is used and wal_keep_size is not set. So easy fix is to set wal_keep_size to 512MB or other in that test. Thought? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Fujii Masao <masao.fujii@oss.nttdata.com> writes: > On 2021/01/14 13:59, Michael Paquier wrote: >> florican is telling that this test has some stability problems: >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican&dt=2021-01-14%2003%3A55%3A45 > My guess is that the requested WAL file was removed unfortunately by > checkpoint because no replication slot is used and wal_keep_size is not set. > So easy fix is to set wal_keep_size to 512MB or other in that test. Thought? florican did pass this test on the v13 branch, so I agree it's probably a timing issue not any deeper bug. Your theory seems plausible. regards, tom lane
On 2021/01/14 14:23, Tom Lane wrote: > Fujii Masao <masao.fujii@oss.nttdata.com> writes: >> On 2021/01/14 13:59, Michael Paquier wrote: >>> florican is telling that this test has some stability problems: >>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican&dt=2021-01-14%2003%3A55%3A45 > >> My guess is that the requested WAL file was removed unfortunately by >> checkpoint because no replication slot is used and wal_keep_size is not set. >> So easy fix is to set wal_keep_size to 512MB or other in that test. Thought? > > florican did pass this test on the v13 branch, so I agree it's probably > a timing issue not any deeper bug. Your theory seems plausible. Thanks for the check! So, barring any objection, I will push the attached patch that sets wal_keep_size in the test. BTW, I included the URL to Michael's report [1] in the commit log. But this URL doesn't seem to work fine maybe because <message-id> part includes a slash character. [1] https://postgr.es/m/X//PsenxcC50jDzX@paquier.xyz Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION