Re: [PATCH] Fix fragile walreceiver test.
| От | Michael Paquier |
|---|---|
| Тема | Re: [PATCH] Fix fragile walreceiver test. |
| Дата | |
| Msg-id | aQrzs1VFGz6cF2bN@paquier.xyz обсуждение исходный текст |
| Ответ на | [PATCH] Fix fragile walreceiver test. (Bryan Green <dbryan.green@gmail.com>) |
| Ответы |
Re: [PATCH] Fix fragile walreceiver test.
|
| Список | pgsql-hackers |
On Wed, Nov 05, 2025 at 12:03:29AM -0600, Bryan Green wrote: > Problem: restart() kills the walreceiver (as it should), which writes > that exact FATAL message to the log. The test then searches the log and > finds it. Timing issue then, the buildfarm has not been complaining on this one AFAIK, there have been no recoveryCheck failures reported: https://buildfarm.postgresql.org/cgi-bin/show_failures.pl > The test has a comment claiming "a new log file is used on node > restart". TAP tests use pg_ctl with a fixed filename that gets reused > across restarts. No log rotation. I've fat-fingered this assumption, indeed, missing that one would need to do an extra rotate_logfile() before the restart. > The fix is obvious: check that the walreceiver PID stays constant. > That's what we actually care about anyway. Hmm. The reason why I didn't use a PID matching check (mentioned at [1]) is that this is not entirely bullet-proof. On a very slow machine, one could assume that standby_1 generates some records and that these are replayed by standby_2 *before* the PID of the WAL receiver is retrieved. This could lead to false positives in some cases, and a bunch of buildfarm members are very slow. You have a point that these would unlikely happen in normal runs, so a PID matching check would be relevant most of the time anyway, even if the original PID has been fetched after the TLI jump has been processed in standby_2. I'd rather keep the log check, TBH, bypassing it with an extra rotate_logfile() before the restart of standby_2. > This matters because changes to I/O behavior elsewhere in the code can > make this test fail spuriously. I hit it while working on O_CLOEXEC > handling for Windows. Fun. And the WAL receiver never stops after the restart of standby_2 with the log entry present in the server logs generated before the restart, right? [1]: https://www.postgresql.org/message-id/aQGfoKGgmAbPATp5@paquier.xyz -- Michael
Вложения
В списке pgsql-hackers по дате отправления: