PG Bug reporting form <noreply@postgresql.org> writes:
> For backups, we use pg_dump to perform a full database dump. Before we start
> a backup, we pause the WAL replay on the secondary, unpausing it after it is
> concluded. This was done since we previously encountered problems with
> pg_dump failing when an AccessExclusiveLock was held on a table that pg_dump
> was going to dump.
> For some time we faced no problems with this setup, but starting some months
> ago, we started witnessing sporadic failures when we attempted to restore
> the dumps of one of our databases, to verify the dump's integrity. These
> restore failures would occur due to a key not being present in a table:
I really have no idea what's going on there, but can you show the exact
pg_dump command(s) being issued? I'm particularly curious whether you
are using parallel dump. The same for the failing pg_restore.
Also, are all the moving parts (primary server, secondary server,
pg_dump, pg_restore) exactly the same PG version?
> We have managed, with some help from the Postgres IRC channel (special
> thanks to user nickb), to work around the problem. The solution was to begin
> a transaction, and extract a snapshot that'd be passed as a pg_dump
> argument, and only then pause WAL replay. From our understanding, pg_dump
> should already implicitly pick a suitable point to start the dump but it
> apparently is not the case, hence the bug report.
It's the other way around: the replay mechanism should not damage
any data that's visible to an open snapshot. So I agree this smells
like a bug, but we don't have enough info here to reproduce it.
regards, tom lane