Re: avoid multiple hard links to same WAL file after a crash

Поиск
Список
Период
Сортировка
От Michael Paquier
Тема Re: avoid multiple hard links to same WAL file after a crash
Дата
Msg-id Ym+28lDf5sCJ6pAj@paquier.xyz
обсуждение исходный текст
Ответ на Re: avoid multiple hard links to same WAL file after a crash  (Michael Paquier <michael@paquier.xyz>)
Ответы Re: avoid multiple hard links to same WAL file after a crash  (Nathan Bossart <nathandbossart@gmail.com>)
Список pgsql-hackers
On Sun, May 01, 2022 at 10:08:53PM +0900, Michael Paquier wrote:
> Now, I am surprised by the third code path of durable_rename_excl(),
> as of the WAL receiver doing writeTimeLineHistoryFile(), to not cause
> any issues, as link() should exit with EEXIST when the startup process
> grabs the same history file concurrently.  It seems to me that in this
> last case using durable_rename() could be an improvement and prevent
> extra WAL receiver restarts as a TLI history fetched from the primary
> via streaming or from some archives should be the same, but we could
> be more careful, like the WAL init logic, by skipping the
> durable_rename() and issuing an elog(LOG).  That would not be perfect,
> still a bit better than the current state of HEAD.

Skimming through at the buildfarm logs, it happens that the tests are
able to see this race from time to time.  Here is one such example on
rorqual:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=rorqual&dt=2022-04-20%2004%3A47%3A58&stg=recovery-check

And here are the relevant logs:
2022-04-20 05:04:19.028 UTC [3109048][startup][:0] LOG:  restored log
file "00000002.history" from archive
2022-04-20 05:04:19.029 UTC [3109111][walreceiver][:0] LOG:  fetching
timeline history file for timeline 2 from primary server
2022-04-20 05:04:19.048 UTC [3109111][walreceiver][:0] FATAL:  could
not link file "pg_wal/xlogtemp.3109111" to "pg_wal/00000002.history":
File exists
[...]
2022-04-20 05:04:19.234 UTC [3109250][walreceiver][:0] LOG:  started
streaming WAL from primary at 0/3000000 on timeline 2

The WAL receiver upgrades the ERROR to a FATAL, and restarts
streaming shortly after.  Using durable_rename() would not be an issue
here.
--
Michael

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tomas Vondra
Дата:
Сообщение: Re: bogus: logical replication rows/cols combinations
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: Add missing MarkGUCPrefixReserved() in basebackup_to_shell module