Обсуждение: "could not open file "pg_wal/…": No such file or directory" potential crashing bug due to race condition between restartpoint and recovery

Поиск
Список
Период
Сортировка
Hi there,

We've found (what seems like) a race condition in a replica server which is receiving WAL from the archive. We are not 100% sure if it happens on streaming replicas as well.
Specifically, we occasionally see an error like this:

May 20 09:56:14 redacted[141]: [191806-1] sql_error_code = 58P01 ERROR: could not open file "pg_wal/00000002000072B50000003A": No such file or directory

Upon looking at logs around this, we can see that the recovery process had just restored that file:

May 20 09:56:14 redacted[9]: [2468859-1] sql_error_code = 00000 LOG: restored log file "00000002000072B50000003A" from archive
May 20 09:56:14 ip-10-0-92-26 redacted[141]: [191806-1] sql_error_code = 58P01 ERROR: could not open file "pg_wal/00000002000072B50000003A": No such file or directory

(on this particular server, pid 9 is the recovery process, and pid 141 is the checkpointer)

We've seen this across a range of server versions, including (in the last 30 days):

9.5.12
9.5.14
9.6.10
9.6.6
9.6.7
9.6.8
9.6.9
10.2
10.3
10.4
10.5

Upon digging into the code, we found (we think) the reason behind this race. The recovery process can call unlink() on the file, whilst the restartpoint attempts to recycle the segment. To quote from the docs:

In archive recovery or standby mode, the server periodically performs restartpoints, which are similar to checkpoints in normal operation: the server forces all its state to disk, updates the pg_control file to indicate that the already-processed WAL data need not be scanned again, and then recycles any old log segment files in the pg_wal directory.

Looking at the code, I think that the two racing functions are RestoreArchivedFile, and CreateRestartPoint.

The former calls unlink on the wal segment, CreateRestartPoint does attempt to do recycling on segments.

From our perspective this seems like a race condition - the recovery process unlinks the file as the checkpointer tries to recycle the wal segment. At this point the checkpointer panics and shuts down, causing a hit to availability.

As far as we have seen, starting the process up after the crash allows the server to continue making progress.

This seems like an extremely rare race - our data show it occurs once every few hundred thousand wal restores from archive that Heroku Postgres does.

Thanks

Tom Crayford
Heroku Postgres
On Mon, Sep 24, 2018 at 12:58:59PM +0100, Thomas Crayford wrote:
> May 20 09:56:14 redacted[9]: [2468859-1] sql_error_code = 00000 LOG:
> restored log file "00000002000072B50000003A" from archive
> May 20 09:56:14 ip-10-0-92-26 redacted[141]: [191806-1] sql_error_code =
> 58P01 ERROR: could not open file "pg_wal/00000002000072B50000003A": No such
> file or directory

What kind of restore_command is used here?

> Looking at the code, I think that the two racing functions are
> RestoreArchivedFile, and CreateRestartPoint.
>
> The former calls unlink on the wal segment, CreateRestartPoint does attempt
> to do recycling on segments.

Don't you mean KeepFileRestoredFromArchive()?  RestoreArchivedFile would
call unlink() on pg_wal/RECOVERYXLOG so that does not match.
--
Michael

Вложения
Hi there,

Ok, thanks for the pointer. It seems like the race condition I talked about is still accurate, does that seem right?

Thanks

Tom

On Mon, Sep 24, 2018 at 4:37 PM Michael Paquier <michael@paquier.xyz> wrote:
On Mon, Sep 24, 2018 at 12:58:59PM +0100, Thomas Crayford wrote:
> May 20 09:56:14 redacted[9]: [2468859-1] sql_error_code = 00000 LOG:
> restored log file "00000002000072B50000003A" from archive
> May 20 09:56:14 ip-10-0-92-26 redacted[141]: [191806-1] sql_error_code =
> 58P01 ERROR: could not open file "pg_wal/00000002000072B50000003A": No such
> file or directory

What kind of restore_command is used here?

> Looking at the code, I think that the two racing functions are
> RestoreArchivedFile, and CreateRestartPoint.
>
> The former calls unlink on the wal segment, CreateRestartPoint does attempt
> to do recycling on segments.

Don't you mean KeepFileRestoredFromArchive()?  RestoreArchivedFile would
call unlink() on pg_wal/RECOVERYXLOG so that does not match.
--
Michael
On Fri, Sep 28, 2018 at 01:02:42PM +0100, Thomas Crayford wrote:
> Ok, thanks for the pointer. It seems like the race condition I talked about
> is still accurate, does that seem right?

KeepFileRestoredFromArchive() looks like a good candidate on the matter
as it removes a WAL segment before replacing it by another with the same
name.  I have a hard time understanding why the checkpointer would try
to recycle a segment just recovered though as the startup process would
immediately try to use it.  I have not spent more than one hour looking
at potential spots though, which is not much for this kind of race
conditions.

It is also why I am curious about what kind of restore_command you are
using.
--
Michael

Вложения
Hi there Michael,

Sorry for the slow response on this - I was oncall last week and it was quite distracting and busy.

With respect to the restore_command, we use wal-e: https://github.com/wal-e/wal-e, specifically:

envdir DIRECTORY wal-e wal-fetch "%f" "%p"

Thanks

Tom

On Fri, Sep 28, 2018 at 11:59 PM Michael Paquier <michael@paquier.xyz> wrote:
On Fri, Sep 28, 2018 at 01:02:42PM +0100, Thomas Crayford wrote:
> Ok, thanks for the pointer. It seems like the race condition I talked about
> is still accurate, does that seem right?

KeepFileRestoredFromArchive() looks like a good candidate on the matter
as it removes a WAL segment before replacing it by another with the same
name.  I have a hard time understanding why the checkpointer would try
to recycle a segment just recovered though as the startup process would
immediately try to use it.  I have not spent more than one hour looking
at potential spots though, which is not much for this kind of race
conditions.

It is also why I am curious about what kind of restore_command you are
using.
--
Michael
On Mon, Oct 01, 2018 at 12:43:02PM +0100, Thomas Crayford wrote:
> Sorry for the slow response on this - I was oncall last week and it was
> quite distracting and busy.
>
> With respect to the restore_command, we use wal-e:
> https://github.com/wal-e/wal-e, specifically:
>
> envdir DIRECTORY wal-e wal-fetch "%f" "%p"

Okay, so that should be fine.  When it comes to such race conditions,
putting directly into the suspicious code paths calls to pg_usleep would
most likely help in enforcing the bug to show up.
--
Michael

Вложения