"could not open file "pg_wal/…": No such file or directory" potential crashing bug due to race condition between restartpoint and recovery

Поиск
Список
Период
Сортировка
От Thomas Crayford
Тема "could not open file "pg_wal/…": No such file or directory" potential crashing bug due to race condition between restartpoint and recovery
Дата
Msg-id CAJgZ2Z7SB1ZFpPPUh6Lczw+HdLc9t2rwmxp7OOEa7+-hJqZtEw@mail.gmail.com
обсуждение исходный текст
Ответы Re: "could not open file "pg_wal/…": No such file or directory" potential crashing bug due to race condition betweenrestartpoint and recovery
Список pgsql-bugs
Hi there,

We've found (what seems like) a race condition in a replica server which is receiving WAL from the archive. We are not 100% sure if it happens on streaming replicas as well.
Specifically, we occasionally see an error like this:

May 20 09:56:14 redacted[141]: [191806-1] sql_error_code = 58P01 ERROR: could not open file "pg_wal/00000002000072B50000003A": No such file or directory

Upon looking at logs around this, we can see that the recovery process had just restored that file:

May 20 09:56:14 redacted[9]: [2468859-1] sql_error_code = 00000 LOG: restored log file "00000002000072B50000003A" from archive
May 20 09:56:14 ip-10-0-92-26 redacted[141]: [191806-1] sql_error_code = 58P01 ERROR: could not open file "pg_wal/00000002000072B50000003A": No such file or directory

(on this particular server, pid 9 is the recovery process, and pid 141 is the checkpointer)

We've seen this across a range of server versions, including (in the last 30 days):

9.5.12
9.5.14
9.6.10
9.6.6
9.6.7
9.6.8
9.6.9
10.2
10.3
10.4
10.5

Upon digging into the code, we found (we think) the reason behind this race. The recovery process can call unlink() on the file, whilst the restartpoint attempts to recycle the segment. To quote from the docs:

In archive recovery or standby mode, the server periodically performs restartpoints, which are similar to checkpoints in normal operation: the server forces all its state to disk, updates the pg_control file to indicate that the already-processed WAL data need not be scanned again, and then recycles any old log segment files in the pg_wal directory.

Looking at the code, I think that the two racing functions are RestoreArchivedFile, and CreateRestartPoint.

The former calls unlink on the wal segment, CreateRestartPoint does attempt to do recycling on segments.

From our perspective this seems like a race condition - the recovery process unlinks the file as the checkpointer tries to recycle the wal segment. At this point the checkpointer panics and shuts down, causing a hit to availability.

As far as we have seen, starting the process up after the crash allows the server to continue making progress.

This seems like an extremely rare race - our data show it occurs once every few hundred thousand wal restores from archive that Heroku Postgres does.

Thanks

Tom Crayford
Heroku Postgres

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: BUG #15396: initdb emits wrong comment for range for effective_io_concurrency
Следующее
От: PG Bug reporting form
Дата:
Сообщение: BUG #15397: perl error