Обсуждение: basebackups seem to have serious issues with FILE_COPY in CREATE DATABASE
Hi,
While doing some additional testing of (incremental) backups, I ran into
a couple regular failures. After pulling my hair for a couple days, I
realized the issue seems to affect regular backups, and incremental
backups (which I've been trying to test) are likely innocent.
I'm using a simple (and admittedly not very pretty) bash scripts that
takes and verified backups, concurrently with this workload:
1) initialize a cluster
2) initialize pgbench in database 'db'
3) run short pgbench on 'db'
4) maybe do vacuum [full] on 'db'
5) drop a database 'db_copy' if it exists
6) create a database 'db_copy' by copying 'db' using one of the
available strategies (file_copy, wal_log)
7) run short pgbench on 'db_copy'
8) maybe do vacuum [full] on 'db_copy'
And concurrently with this, it takes a basebackup, starts a cluster on
it (on a different port, ofc), and does various checks on that:
a) verify checksums using pg_checksums (cluster has them enabled)
b) run amcheck on tables/indexes on both databases
c) SQL check (we expect all tables to be 'consistent' as if we did a
PITR - in particular sum(balance) is expected to be the same value on
all pgbench tables) on both databases
I believe those are reasonable expectations - that we get a database
with valid checksums, with non-broken tables/indexes, and that the
database looks as a snapshot taken at a single instant.
Unfortunately it doesn't take long for the tests to start failing with
various strange symptoms on the db_copy database (I'm yet to see an
issue on the 'db' database):
i) amcheck fails with 'heap tuple lacks matching index tuple'
ERROR: heap tuple (116195,22) from table "pgbench_accounts" lacks
matching index tuple within index "pgbench_accounts_pkey"
HINT: Retrying verification using the function
bt_index_parent_check() might provide a more specific error.
I've seen this with other tables/indexes too, e.g. system catalogs
pg_statitics or toast tables, but 'accounts' is most common.
ii) amcheck fails with 'could not open file'
ERROR: could not open file "base/18121/18137": No such file or
directory
LINE 9: lateral verify_heapam(relation => c.oid, on_error_stop =>
f...
^
ERROR: could not open file "base/18121/18137": No such file or
directory
iii) failures in the SQL check, with different tables have different
balance sums
SQL check fails (db_copy) (account 156142 branches 136132 tellers
136132 history -42826)
Sometimes this is preceded by amcheck issue, but not always.
I guess this is not the behavior we expect :-(
I've reproduced all of this on PG16 - I haven't tried with older
releases, but I have no reason to assume pre-16 releases are not affected.
With incremental backups I've observed a couple more symptoms, but those
are most likely just fallout of this - not realizing the initial state
is a bit wrong, and making it worse by applying the increments.
The important observation is that this only happens if a database is
created while the backup is running, and that it only happens with the
FILE_COPY strategy - I've never seen this with WAL_LOG (which is the
default since PG15).
I don't recall any reports of similar issues from pre-15 releases, where
FILE_COPY was the only available option - I'm not sure why is that.
Either it didn't have this issue back then, or maybe people happen to
not create databases concurrently with a backup very often. It's a race
condition / timing issue, essentially.
I have no ambition to investigate this part of the code much deeper, or
invent a fix myself, at least not in foreseeable future. But it seems
like something we probably should fix - subtly broken backups are not a
great thing.
I see there have been a couple threads proposing various improvements to
FILE_COPY, that might make it more efficient/faster, namely using the
filesystem cloning [1] or switching pg_upgrade to use it [2]. But having
something that's (maybe) faster but not quite correct does not seem like
a winning strategy to me ...
Alternatively, if we don't have clear desire to fix it, maybe the right
solution would be get rid of it?
regards
[1]
https://www.postgresql.org/message-id/CA+hUKGLM+t+SwBU-cHeMUXJCOgBxSHLGZutV5zCwY4qrCcE02w@mail.gmail.com
[2] https://www.postgresql.org/message-id/Zl9ta3FtgdjizkJ5%40nathan
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения
Re: basebackups seem to have serious issues with FILE_COPY in CREATE DATABASE
От
Nathan Bossart
Дата:
On Mon, Jun 24, 2024 at 04:12:38PM +0200, Tomas Vondra wrote:
> The important observation is that this only happens if a database is
> created while the backup is running, and that it only happens with the
> FILE_COPY strategy - I've never seen this with WAL_LOG (which is the
> default since PG15).
My first thought is that this sounds related to the large comment in
CreateDatabaseUsingFileCopy():
/*
* We force a checkpoint before committing. This effectively means that
* committed XLOG_DBASE_CREATE_FILE_COPY operations will never need to be
* replayed (at least not in ordinary crash recovery; we still have to
* make the XLOG entry for the benefit of PITR operations). This avoids
* two nasty scenarios:
*
* #1: When PITR is off, we don't XLOG the contents of newly created
* indexes; therefore the drop-and-recreate-whole-directory behavior of
* DBASE_CREATE replay would lose such indexes.
*
* #2: Since we have to recopy the source database during DBASE_CREATE
* replay, we run the risk of copying changes in it that were committed
* after the original CREATE DATABASE command but before the system crash
* that led to the replay. This is at least unexpected and at worst could
* lead to inconsistencies, eg duplicate table names.
*
* (Both of these were real bugs in releases 8.0 through 8.0.3.)
*
* In PITR replay, the first of these isn't an issue, and the second is
* only a risk if the CREATE DATABASE and subsequent template database
* change both occur while a base backup is being taken. There doesn't
* seem to be much we can do about that except document it as a
* limitation.
*
* See CreateDatabaseUsingWalLog() for a less cheesy CREATE DATABASE
* strategy that avoids these problems.
*/
> I don't recall any reports of similar issues from pre-15 releases, where
> FILE_COPY was the only available option - I'm not sure why is that.
> Either it didn't have this issue back then, or maybe people happen to
> not create databases concurrently with a backup very often. It's a race
> condition / timing issue, essentially.
If it requires concurrent activity on the template database, I wouldn't be
surprised at all that this is rare.
> I see there have been a couple threads proposing various improvements to
> FILE_COPY, that might make it more efficient/faster, namely using the
> filesystem cloning [1] or switching pg_upgrade to use it [2]. But having
> something that's (maybe) faster but not quite correct does not seem like
> a winning strategy to me ...
>
> Alternatively, if we don't have clear desire to fix it, maybe the right
> solution would be get rid of it?
It would be unfortunate if we couldn't use this for pg_upgrade, especially
if it is unaffected by these problems.
--
nathan
On 6/24/24 17:14, Nathan Bossart wrote:
> On Mon, Jun 24, 2024 at 04:12:38PM +0200, Tomas Vondra wrote:
>> The important observation is that this only happens if a database is
>> created while the backup is running, and that it only happens with the
>> FILE_COPY strategy - I've never seen this with WAL_LOG (which is the
>> default since PG15).
>
> My first thought is that this sounds related to the large comment in
> CreateDatabaseUsingFileCopy():
>
> /*
> * We force a checkpoint before committing. This effectively means that
> * committed XLOG_DBASE_CREATE_FILE_COPY operations will never need to be
> * replayed (at least not in ordinary crash recovery; we still have to
> * make the XLOG entry for the benefit of PITR operations). This avoids
> * two nasty scenarios:
> *
> * #1: When PITR is off, we don't XLOG the contents of newly created
> * indexes; therefore the drop-and-recreate-whole-directory behavior of
> * DBASE_CREATE replay would lose such indexes.
> *
> * #2: Since we have to recopy the source database during DBASE_CREATE
> * replay, we run the risk of copying changes in it that were committed
> * after the original CREATE DATABASE command but before the system crash
> * that led to the replay. This is at least unexpected and at worst could
> * lead to inconsistencies, eg duplicate table names.
> *
> * (Both of these were real bugs in releases 8.0 through 8.0.3.)
> *
> * In PITR replay, the first of these isn't an issue, and the second is
> * only a risk if the CREATE DATABASE and subsequent template database
> * change both occur while a base backup is being taken. There doesn't
> * seem to be much we can do about that except document it as a
> * limitation.
> *
> * See CreateDatabaseUsingWalLog() for a less cheesy CREATE DATABASE
> * strategy that avoids these problems.
> */
>
Perhaps, the mentioned risks certainly seem like it might be related to
the issues I'm observing.
>> I don't recall any reports of similar issues from pre-15 releases, where
>> FILE_COPY was the only available option - I'm not sure why is that.
>> Either it didn't have this issue back then, or maybe people happen to
>> not create databases concurrently with a backup very often. It's a race
>> condition / timing issue, essentially.
>
> If it requires concurrent activity on the template database, I wouldn't be
> surprised at all that this is rare.
>
Right. Although, "concurrent" here means a somewhat different thing.
AFAIK there can't be a any changes concurrent with the CREATE DATABASE
directly, because we make sure there are no connections:
createdb: error: database creation failed: ERROR: source database
"test" is being accessed by other users
DETAIL: There is 1 other session using the database.
But per the comment, it'd be a problem if there is activity after the
database gets copied, but before the backup completes (which is where
the replay will happen).
>> I see there have been a couple threads proposing various improvements to
>> FILE_COPY, that might make it more efficient/faster, namely using the
>> filesystem cloning [1] or switching pg_upgrade to use it [2]. But having
>> something that's (maybe) faster but not quite correct does not seem like
>> a winning strategy to me ...
>>
>> Alternatively, if we don't have clear desire to fix it, maybe the right
>> solution would be get rid of it?
>
> It would be unfortunate if we couldn't use this for pg_upgrade, especially
> if it is unaffected by these problems.
>
Yeah. I wouldn't mind using FILE_COPY in contexts where we know it's
safe, like pg_upgrade. I just don't want to let users to unknowingly
step on this.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company