Re: [PATCH] Fix fragile walreceiver test.
| От | Xuneng Zhou |
|---|---|
| Тема | Re: [PATCH] Fix fragile walreceiver test. |
| Дата | |
| Msg-id | CABPTF7WOqRKu6Agb6rFgZZ=X2mtaVbM7G=GMYVBx=F2L2TPQsw@mail.gmail.com обсуждение исходный текст |
| Ответ на | Re: [PATCH] Fix fragile walreceiver test. (Michael Paquier <michael@paquier.xyz>) |
| Ответы |
Re: [PATCH] Fix fragile walreceiver test.
|
| Список | pgsql-hackers |
Hi, On Wed, Nov 5, 2025 at 2:50 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Wed, Nov 05, 2025 at 12:03:29AM -0600, Bryan Green wrote: > > Problem: restart() kills the walreceiver (as it should), which writes > > that exact FATAL message to the log. The test then searches the log and > > finds it. > > Timing issue then, the buildfarm has not been complaining on this one > AFAIK, there have been no recoveryCheck failures reported: > https://buildfarm.postgresql.org/cgi-bin/show_failures.pl > > > The test has a comment claiming "a new log file is used on node > > restart". TAP tests use pg_ctl with a fixed filename that gets reused > > across restarts. No log rotation. > > I've fat-fingered this assumption, indeed, missing that one would need > to do an extra rotate_logfile() before the restart. > > > The fix is obvious: check that the walreceiver PID stays constant. > > That's what we actually care about anyway. Thanks for catching and analyzing this! > Hmm. The reason why I didn't use a PID matching check (mentioned at > [1]) is that this is not entirely bullet-proof. On a very slow > machine, one could assume that standby_1 generates some records and > that these are replayed by standby_2 *before* the PID of the WAL > receiver is retrieved. This could lead to false positives in some > cases, and a bunch of buildfarm members are very slow. You have a > point that these would unlikely happen in normal runs, so a PID > matching check would be relevant most of the time anyway, even if the > original PID has been fetched after the TLI jump has been processed in > standby_2. I'd rather keep the log check, TBH, bypassing it with an > extra rotate_logfile() before the restart of standby_2. I’ve also prepared a patch for this method. > > This matters because changes to I/O behavior elsewhere in the code can > > make this test fail spuriously. I hit it while working on O_CLOEXEC > > handling for Windows. > > Fun. And the WAL receiver never stops after the restart of standby_2 > with the log entry present in the server logs generated before the > restart, right? > Best, Xuneng
Вложения
В списке pgsql-hackers по дате отправления: