Обсуждение: [PATCH] Fix fragile walreceiver test.
The recovery/004_timeline_switch test has been failing for me on
Windows. The test is wrong.
The test does this:
$node_standby_2->restart;
# ... timeline switch happens ...
ok( !$node_standby_2->log_contains(
"FATAL: .* terminating walreceiver process due to
administrator command"
),
'WAL receiver should not be stopped across timeline jumps');
Problem: restart() kills the walreceiver (as it should), which writes
that exact FATAL message to the log. The test then searches the log and
finds it.
The test has a comment claiming "a new log file is used on node
restart". TAP tests use pg_ctl with a fixed filename that gets reused
across restarts. No log rotation.
I added logging to confirm what's actually happening. The walreceiver
works correctly - same PID handles both timelines:
2025-11-04 23:05:28.539 CST walreceiver[83824] LOG: started
streaming WAL from primary at 0/03000000 on timeline 1
2025-11-04 23:05:28.543 CST startup[42764] LOG: new target timeline
is 2
2025-11-04 23:05:28.544 CST walreceiver[83824] LOG: restarted WAL
streaming at 0/03000000 on timeline 2
That's PID 83824 throughout. Works fine.
Earlier in the same log, from the restart:
2025-11-04 23:05:27.261 CST walreceiver[52440] FATAL: terminating
walreceiver process due to administrator command
Different PID (52440), expected shutdown. This is what the test finds.
The fix is obvious: check that the walreceiver PID stays constant.
That's what we actually care about anyway.
This matters because changes to I/O behavior elsewhere in the code can
make this test fail spuriously. I hit it while working on O_CLOEXEC
handling for Windows.
Patch attached.
--
Bryan Green
EDB: https://www.enterprisedb.com
Вложения
On Wed, Nov 05, 2025 at 12:03:29AM -0600, Bryan Green wrote: > Problem: restart() kills the walreceiver (as it should), which writes > that exact FATAL message to the log. The test then searches the log and > finds it. Timing issue then, the buildfarm has not been complaining on this one AFAIK, there have been no recoveryCheck failures reported: https://buildfarm.postgresql.org/cgi-bin/show_failures.pl > The test has a comment claiming "a new log file is used on node > restart". TAP tests use pg_ctl with a fixed filename that gets reused > across restarts. No log rotation. I've fat-fingered this assumption, indeed, missing that one would need to do an extra rotate_logfile() before the restart. > The fix is obvious: check that the walreceiver PID stays constant. > That's what we actually care about anyway. Hmm. The reason why I didn't use a PID matching check (mentioned at [1]) is that this is not entirely bullet-proof. On a very slow machine, one could assume that standby_1 generates some records and that these are replayed by standby_2 *before* the PID of the WAL receiver is retrieved. This could lead to false positives in some cases, and a bunch of buildfarm members are very slow. You have a point that these would unlikely happen in normal runs, so a PID matching check would be relevant most of the time anyway, even if the original PID has been fetched after the TLI jump has been processed in standby_2. I'd rather keep the log check, TBH, bypassing it with an extra rotate_logfile() before the restart of standby_2. > This matters because changes to I/O behavior elsewhere in the code can > make this test fail spuriously. I hit it while working on O_CLOEXEC > handling for Windows. Fun. And the WAL receiver never stops after the restart of standby_2 with the log entry present in the server logs generated before the restart, right? [1]: https://www.postgresql.org/message-id/aQGfoKGgmAbPATp5@paquier.xyz -- Michael
Вложения
Hi, On Wed, Nov 5, 2025 at 2:50 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Wed, Nov 05, 2025 at 12:03:29AM -0600, Bryan Green wrote: > > Problem: restart() kills the walreceiver (as it should), which writes > > that exact FATAL message to the log. The test then searches the log and > > finds it. > > Timing issue then, the buildfarm has not been complaining on this one > AFAIK, there have been no recoveryCheck failures reported: > https://buildfarm.postgresql.org/cgi-bin/show_failures.pl > > > The test has a comment claiming "a new log file is used on node > > restart". TAP tests use pg_ctl with a fixed filename that gets reused > > across restarts. No log rotation. > > I've fat-fingered this assumption, indeed, missing that one would need > to do an extra rotate_logfile() before the restart. > > > The fix is obvious: check that the walreceiver PID stays constant. > > That's what we actually care about anyway. Thanks for catching and analyzing this! > Hmm. The reason why I didn't use a PID matching check (mentioned at > [1]) is that this is not entirely bullet-proof. On a very slow > machine, one could assume that standby_1 generates some records and > that these are replayed by standby_2 *before* the PID of the WAL > receiver is retrieved. This could lead to false positives in some > cases, and a bunch of buildfarm members are very slow. You have a > point that these would unlikely happen in normal runs, so a PID > matching check would be relevant most of the time anyway, even if the > original PID has been fetched after the TLI jump has been processed in > standby_2. I'd rather keep the log check, TBH, bypassing it with an > extra rotate_logfile() before the restart of standby_2. I’ve also prepared a patch for this method. > > This matters because changes to I/O behavior elsewhere in the code can > > make this test fail spuriously. I hit it while working on O_CLOEXEC > > handling for Windows. > > Fun. And the WAL receiver never stops after the restart of standby_2 > with the log entry present in the server logs generated before the > restart, right? > Best, Xuneng
Вложения
On Wed, Nov 05, 2025 at 03:30:30PM +0800, Xuneng Zhou wrote: > On Wed, Nov 5, 2025 at 2:50 PM Michael Paquier <michael@paquier.xyz> wrote: >> Timing issue then, the buildfarm has not been complaining on this one >> AFAIK, there have been no recoveryCheck failures reported: >> https://buildfarm.postgresql.org/cgi-bin/show_failures.pl drongo has just reported one failure, so I stand corrected: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2025-11-05%2003%3A50%3A50 And one log rotation should be enough before the restart. >> Hmm. The reason why I didn't use a PID matching check (mentioned at >> [1]) is that this is not entirely bullet-proof. On a very slow >> machine, one could assume that standby_1 generates some records and >> that these are replayed by standby_2 *before* the PID of the WAL >> receiver is retrieved. This could lead to false positives in some >> cases, and a bunch of buildfarm members are very slow. You have a >> point that these would unlikely happen in normal runs, so a PID >> matching check would be relevant most of the time anyway, even if the >> original PID has been fetched after the TLI jump has been processed in >> standby_2. I'd rather keep the log check, TBH, bypassing it with an >> extra rotate_logfile() before the restart of standby_2. > > I’ve also prepared a patch for this method. That's exactly what I have done a couple of minutes ago, and noticed your message before applying the fix so I've listed you are a co-author on this one. I have also kept the PID check after pondering a bit about it. A TLI jump could be replayed before we grab the initial PID, but in most cases it should be able to do its work correctly. -- Michael
Вложения
Hi, On Wed, Nov 5, 2025 at 3:56 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Wed, Nov 05, 2025 at 03:30:30PM +0800, Xuneng Zhou wrote: > > On Wed, Nov 5, 2025 at 2:50 PM Michael Paquier <michael@paquier.xyz> wrote: > >> Timing issue then, the buildfarm has not been complaining on this one > >> AFAIK, there have been no recoveryCheck failures reported: > >> https://buildfarm.postgresql.org/cgi-bin/show_failures.pl > > drongo has just reported one failure, so I stand corrected: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2025-11-05%2003%3A50%3A50 > > And one log rotation should be enough before the restart. > > >> Hmm. The reason why I didn't use a PID matching check (mentioned at > >> [1]) is that this is not entirely bullet-proof. On a very slow > >> machine, one could assume that standby_1 generates some records and > >> that these are replayed by standby_2 *before* the PID of the WAL > >> receiver is retrieved. This could lead to false positives in some > >> cases, and a bunch of buildfarm members are very slow. You have a > >> point that these would unlikely happen in normal runs, so a PID > >> matching check would be relevant most of the time anyway, even if the > >> original PID has been fetched after the TLI jump has been processed in > >> standby_2. I'd rather keep the log check, TBH, bypassing it with an > >> extra rotate_logfile() before the restart of standby_2. > > > > I’ve also prepared a patch for this method. > > That's exactly what I have done a couple of minutes ago, and noticed > your message before applying the fix so I've listed you are a > co-author on this one. > Thanks. > I have also kept the PID check after pondering a bit about it. A TLI > jump could be replayed before we grab the initial PID, but in most > cases it should be able to do its work correctly. Checking the PID seems straightforward and makes sense to me mostly. Best, Xuneng
On 11/5/2025 1:55 AM, Michael Paquier wrote: > On Wed, Nov 05, 2025 at 03:30:30PM +0800, Xuneng Zhou wrote: >> On Wed, Nov 5, 2025 at 2:50 PM Michael Paquier <michael@paquier.xyz> wrote: >>> Timing issue then, the buildfarm has not been complaining on this one >>> AFAIK, there have been no recoveryCheck failures reported: >>> https://buildfarm.postgresql.org/cgi-bin/show_failures.pl > > drongo has just reported one failure, so I stand corrected: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2025-11-05%2003%3A50%3A50 > Perfect timing on that drongo failure - validates exactly what I was seeing with the O_CLOEXEC changes. > And one log rotation should be enough before the restart. > >>> Hmm. The reason why I didn't use a PID matching check (mentioned at >>> [1]) is that this is not entirely bullet-proof. On a very slow >>> machine, one could assume that standby_1 generates some records and >>> that these are replayed by standby_2 *before* the PID of the WAL >>> receiver is retrieved. This could lead to false positives in some >>> cases, and a bunch of buildfarm members are very slow. You have a >>> point that these would unlikely happen in normal runs, so a PID >>> matching check would be relevant most of the time anyway, even if the >>> original PID has been fetched after the TLI jump has been processed in >>> standby_2. I'd rather keep the log check, TBH, bypassing it with an >>> extra rotate_logfile() before the restart of standby_2. >> >> I’ve also prepared a patch for this method. > > That's exactly what I have done a couple of minutes ago, and noticed > your message before applying the fix so I've listed you are a > co-author on this one. > Thanks for the quick turnaround and for including both Xuneng and me. > I have also kept the PID check after pondering a bit about it. A TLI > jump could be replayed before we grab the initial PID, but in most > cases it should be able to do its work correctly. > -- > Michael The rotate_logfile() fixes the fundamental issue (log file reuse), while the PID check adds extra verification for the common case. Belt and braces approach makes sense here. Your point about slow machines is well taken - I hadn't considered the window between restart and PID capture on heavily loaded systems. This clears the path for my O_CLOEXEC work without worrying about spurious test failures. Appreciate it.
On Wed, Nov 05, 2025 at 07:38:23AM -0600, Bryan Green wrote: > Thanks for the quick turnaround and for including both Xuneng and me. I owe you one here for noticing the issue quickly. The buildfarm took some time to reproduce the failure and we have a minor release coming next week. It would be annoying to release something with a known instability, at least for me. -- Michael
Вложения
On Thu, Nov 06, 2025 at 07:58:55AM +0900, Michael Paquier wrote: > I owe you one here for noticing the issue quickly. The buildfarm > took some time to reproduce the failure and we have a minor release > coming next week. It would be annoying to release something with a > known instability, at least for me. drongo has gone back to green, so we should be good. See: https://buildfarm.postgresql.org/cgi-bin/show_failures.pl?max_days=3&member=drongo&filter=Submit -- Michael