Обсуждение: [BUG] pg_basebackup produces wrong incremental files after relation truncation in segmented tables
Hello PostgreSQL developers,
I’ve encountered a bug in the incremental backup feature that prevents restoration of backups containing relations larger than 1 GB that were vacuum-truncated.
Problem Description
When taking incremental backups of relations that span multiple segments, if the relation is truncated during VACUUM (after the base backup but before the incremental one), pg_combinebackup fails with:
```
file "%s" has truncation block length %u in excess of segment size %u
```
pg_basebackup itself completes without errors, but the resulting incremental backup cannot be restored.
Root Cause
In segmented relations, a VACUUM that truncates blocks sets a limit_block in the WAL summary. The incremental restore logic miscalculates truncation_block_length when processing segment 0…N, because it compares the segment-local size with a relation-wide limit.
In src/backend/backup/basebackup_incremental.c:
```
*truncation_block_length = size / BLCKSZ;
if (BlockNumberIsValid(limit_block))
{
unsigned relative_limit = limit_block - segno * RELSEG_SIZE;
if (*truncation_block_length < relative_limit) /* ← problematic */
*truncation_block_length = relative_limit;
}
```
For example, if limit_block lies in segment 10, then relative_limit will be roughly 9 * RELSEG_SIZE while processing segment 0. This forces truncation_block_length far beyond the actual segment size, leading to a segment length larger than RELSEG_SIZE and eventually the restore error.
Reproduction Steps
Create a table larger than 1 GB (multiple segments).
Take a full base backup.
Delete rows that occupy the end of the relation.
Run VACUUM (VERBOSE, TRUNCATE) to ensure blocks are removed.
(optional) Confirm that the WAL summary includes a limit entry for the relation.
Take an incremental backup with pg_basebackup.
Attempt to restore using pg_combinebackup.
Observe the truncation block length error.
Patch
A patch correcting this logic is attached, and I’m happy to provide additional details or revisions if helpful.
Best regards,
Oleg Tkachenko
Вложения
Hello,
I am following up on my report with a patch, which did not receive any responses. I suspect the issue only manifests under specific conditions, so I am sending additional details along with a reliable reproducer.
For context:
Incremental backups cannot be restored when a relation larger than 1 GB (multiple segments) is vacuum-truncated between the base backup and the incremental backup (pg_basebackup itself completes successfully, but the resulting incremental backup is not restorable)
For segmented relations, the WAL summarizer records limit_block in the WAL summary. During incremental backup, the truncation length is computed incorrectly because a relation-wide limit is compared against a segment-local size.
Reproducer
I am attaching a bash script that reliably reproduces the issue on my system. The script:
Creates a table large enough to span multiple segments.
Takes a full base backup.
Deletes rows at the end of the relation.
Runs VACUUM (TRUNCATE) to remove blocks.
Takes an incremental backup.
Fails during pg_combinebackup.
The script is fully automated and intended to be run as-is.
A patch itself in the previous message.
I would appreciate feedback on the approach and am happy to revise it if needed.
Best regards,
Oleg Tkachenko
On Nov 14, 2025, at 20:43, Oleg Tkachenko <oatkachenko@gmail.com> wrote:<bug_truncation_block_length.patch>Hello PostgreSQL developers,
I’ve encountered a bug in the incremental backup feature that prevents restoration of backups containing relations larger than 1 GB that were vacuum-truncated.
Problem Description
When taking incremental backups of relations that span multiple segments, if the relation is truncated during VACUUM (after the base backup but before the incremental one), pg_combinebackup fails with:
```
file "%s" has truncation block length %u in excess of segment size %u
```
pg_basebackup itself completes without errors, but the resulting incremental backup cannot be restored.
Root Cause
In segmented relations, a VACUUM that truncates blocks sets a limit_block in the WAL summary. The incremental restore logic miscalculates truncation_block_length when processing segment 0…N, because it compares the segment-local size with a relation-wide limit.
In src/backend/backup/basebackup_incremental.c:
```
*truncation_block_length = size / BLCKSZ;
if (BlockNumberIsValid(limit_block))
{
unsigned relative_limit = limit_block - segno * RELSEG_SIZE;
if (*truncation_block_length < relative_limit) /* ← problematic */
*truncation_block_length = relative_limit;
}
```
For example, if limit_block lies in segment 10, then relative_limit will be roughly 9 * RELSEG_SIZE while processing segment 0. This forces truncation_block_length far beyond the actual segment size, leading to a segment length larger than RELSEG_SIZE and eventually the restore error.
Reproduction Steps
Create a table larger than 1 GB (multiple segments).
Take a full base backup.
Delete rows that occupy the end of the relation.
Run VACUUM (VERBOSE, TRUNCATE) to ensure blocks are removed.
(optional) Confirm that the WAL summary includes a limit entry for the relation.
Take an incremental backup with pg_basebackup.
Attempt to restore using pg_combinebackup.
Observe the truncation block length error.
Patch
A patch correcting this logic is attached, and I’m happy to provide additional details or revisions if helpful.
Best regards,
Oleg Tkachenko
Вложения
On Mon, Dec 15, 2025 at 4:39 PM Oleg Tkachenko <oatkachenko@gmail.com> wrote:
>
> [....]
>
> A patch correcting this logic is attached, and I’m happy to provide additional details or revisions if helpful.
>
Thanks for the reproducer; I can see the reported issue, but I am not
quite sure the proposed fix is correct and might break other cases (I
haven't tried constructed that case yet) but there is a comment
detailing that case just before the point where you are planning to do
the changes:
/*
* The truncation block length is the minimum length of the reconstructed
* file. Any block numbers below this threshold that are not present in
* the backup need to be fetched from the prior backup. At or above this
* threshold, blocks should only be included in the result if they are
* present in the backup. (This may require inserting zero blocks if the
* blocks included in the backup are non-consecutive.)
*/
IIUC, we might need the original assignment logic as it is. But we
need to ensure that truncation_block_length is not set to a value that
exceeds RELSEG_SIZE.
Regards,
Amul
[ sorry for not noticing this thread sooner; thanks to Andres for pointing me to it ] On Mon, Dec 15, 2025 at 9:01 AM Amul Sul <sulamul@gmail.com> wrote: > Thanks for the reproducer; I can see the reported issue, but I am not > quite sure the proposed fix is correct and might break other cases (I > haven't tried constructed that case yet) but there is a comment > detailing that case just before the point where you are planning to do > the changes: > > /* > * The truncation block length is the minimum length of the reconstructed > * file. Any block numbers below this threshold that are not present in > * the backup need to be fetched from the prior backup. At or above this > * threshold, blocks should only be included in the result if they are > * present in the backup. (This may require inserting zero blocks if the > * blocks included in the backup are non-consecutive.) > */ > > IIUC, we might need the original assignment logic as it is. But we > need to ensure that truncation_block_length is not set to a value that > exceeds RELSEG_SIZE. I think you're right. By way of example, let's say that the current length of the file is 200 blocks, but the limit block is 100 blocks into the current segment. That means that the only blocks that we can get from any previous backup are blocks 0-99. Blocks 100-199 of the current segment are either mentioned in the WAL summaries we're using for this backup, or they're all zeroes. We can't set the truncation_block_length to a value greater than 100, or we'll go looking for the contents of any zero-filled blocks in previous backups, will will either fail or produce the wrong answer. But Oleg is correct that we also shouldn't set it to a value greater than RELSEG_SIZE. So my guess is that the correct fix might be something like the attached (untested, for discussion). -- Robert Haas EDB: http://www.enterprisedb.com
Вложения
On Dec 15, 2025, at 17:35, Robert Haas <robertmhaas@gmail.com> wrote:[ sorry for not noticing this thread sooner; thanks to Andres for
pointing me to it ]
On Mon, Dec 15, 2025 at 9:01 AM Amul Sul <sulamul@gmail.com> wrote:Thanks for the reproducer; I can see the reported issue, but I am not
quite sure the proposed fix is correct and might break other cases (I
haven't tried constructed that case yet) but there is a comment
detailing that case just before the point where you are planning to do
the changes:
/*
* The truncation block length is the minimum length of the reconstructed
* file. Any block numbers below this threshold that are not present in
* the backup need to be fetched from the prior backup. At or above this
* threshold, blocks should only be included in the result if they are
* present in the backup. (This may require inserting zero blocks if the
* blocks included in the backup are non-consecutive.)
*/
IIUC, we might need the original assignment logic as it is. But we
need to ensure that truncation_block_length is not set to a value that
exceeds RELSEG_SIZE.
I think you're right. By way of example, let's say that the current
length of the file is 200 blocks, but the limit block is 100 blocks
into the current segment. That means that the only blocks that we can
get from any previous backup are blocks 0-99. Blocks 100-199 of the
current segment are either mentioned in the WAL summaries we're using
for this backup, or they're all zeroes. We can't set the
truncation_block_length to a value greater than 100, or we'll go
looking for the contents of any zero-filled blocks in previous
backups, will will either fail or produce the wrong answer. But Oleg
is correct that we also shouldn't set it to a value greater than
RELSEG_SIZE. So my guess is that the correct fix might be something
like the attached (untested, for discussion).
--
Robert Haas
EDB: http://www.enterprisedb.com
Вложения
On Mon, Dec 15, 2025 at 1:46 PM Oleg Tkachenko <oatkachenko@gmail.com> wrote: > Also, I’ve attached a patch based on your guidance. The changes are effectively the same as your suggested approach, butI would be happy to be listed as a contributor. You'll certain be listed as the reporter for this issue when a fix is committed. If you want to be listed as a co-author of the patch, I think it is fair to say that it will need to contain some code written by you. For example, maybe you would like to try writing a TAP test case that fails without this fix and passes with it. -- Robert Haas EDB: http://www.enterprisedb.com
> On Dec 16, 2025, at 00:35, Robert Haas <robertmhaas@gmail.com> wrote: > > [ sorry for not noticing this thread sooner; thanks to Andres for > pointing me to it ] > > On Mon, Dec 15, 2025 at 9:01 AM Amul Sul <sulamul@gmail.com> wrote: >> Thanks for the reproducer; I can see the reported issue, but I am not >> quite sure the proposed fix is correct and might break other cases (I >> haven't tried constructed that case yet) but there is a comment >> detailing that case just before the point where you are planning to do >> the changes: >> >> /* >> * The truncation block length is the minimum length of the reconstructed >> * file. Any block numbers below this threshold that are not present in >> * the backup need to be fetched from the prior backup. At or above this >> * threshold, blocks should only be included in the result if they are >> * present in the backup. (This may require inserting zero blocks if the >> * blocks included in the backup are non-consecutive.) >> */ >> >> IIUC, we might need the original assignment logic as it is. But we >> need to ensure that truncation_block_length is not set to a value that >> exceeds RELSEG_SIZE. > > I think you're right. By way of example, let's say that the current > length of the file is 200 blocks, but the limit block is 100 blocks > into the current segment. That means that the only blocks that we can > get from any previous backup are blocks 0-99. Blocks 100-199 of the > current segment are either mentioned in the WAL summaries we're using > for this backup, or they're all zeroes. We can't set the > truncation_block_length to a value greater than 100, or we'll go > looking for the contents of any zero-filled blocks in previous > backups, will will either fail or produce the wrong answer. But Oleg > is correct that we also shouldn't set it to a value greater than > RELSEG_SIZE. So my guess is that the correct fix might be something > like the attached (untested, for discussion). > > -- > Robert Haas > EDB: http://www.enterprisedb.com > <v1-0001-Don-t-set-the-truncation_block_length-rather-than.patch> The change looks good to me. Only nitpick is: ``` Subject: [PATCH v1] Don't set the truncation_block_length rather than RELSEG_SIZE. ``` I guess you meant to say “larger (or greater) than” instead of “rather than”. Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/
On Tue, Dec 16, 2025 at 3:06 AM Chao Li <li.evan.chao@gmail.com> wrote: > I guess you meant to say “larger (or greater) than” instead of “rather than”. Yes, thanks. -- Robert Haas EDB: http://www.enterprisedb.com
Hello, Robert
I’ve created a small test that reproduces the issue. With the proposed fix applied, the test passes, and the reconstruction behaves as expected.
I’m attaching the test for review. Please let me know if this looks OK or if you would like it changed.
Regards,
Oleg
On Dec 15, 2025, at 21:13, Robert Haas <robertmhaas@gmail.com> wrote:On Mon, Dec 15, 2025 at 1:46 PM Oleg Tkachenko <oatkachenko@gmail.com> wrote:Also, I’ve attached a patch based on your guidance. The changes are effectively the same as your suggested approach, but I would be happy to be listed as a contributor.
You'll certain be listed as the reporter for this issue when a fix is
committed. If you want to be listed as a co-author of the patch, I
think it is fair to say that it will need to contain some code written
by you. For example, maybe you would like to try writing a TAP test
case that fails without this fix and passes with it.
--
Robert Haas
EDB: http://www.enterprisedb.com
Вложения
On Wed, Dec 17, 2025 at 3:25 AM Oleg Tkachenko <oatkachenko@gmail.com> wrote: > > Hello, Robert > > I’ve created a small test that reproduces the issue. With the proposed fix applied, the test passes, and the reconstructionbehaves as expected. > > I’m attaching the test for review. Please let me know if this looks OK or if you would like it changed. > Test looks good to me, but I have three suggestions as follow: 1. To minimize repetition in insert: use fillfactor 10, which is the minimal we can set for a table, so that we can minimize tuples per page. Use a longer string and lower count in repeat(), which I believe helps the test become a bit faster. 2. I think we could add this test to the existing pg_combinebackup's test file instead of creating a new file with a single-test. See the attached version; it’s a bit smaller than your original patch, but since I haven't copied all of your comments yet, I’ve marked it as WIP. 3. Kindly combine the code fix and tests together into a single patch. Regards, Amul
Вложения
On Thu, Dec 18, 2025 at 1:05 AM Amul Sul <sulamul@gmail.com> wrote: > Test looks good to me, but I have three suggestions as follow: > > 1. To minimize repetition in insert: use fillfactor 10, which is the > minimal we can set for a table, so that we can minimize tuples per > page. Use a longer string and lower count in repeat(), which I believe > helps the test become a bit faster. I haven't checked how big a relation the test case creates, but it's worth keeping in mind that the CI tests run on one platform with the segment size set to six blocks. I think we should design the test case with that in mind i.e. don't worry about catching the bug when the segment size is 1GB, but make sure the test fails in CI without the bug fix. Let's not rely on fillfactor -- the cost here is the disk space and the time to write the blocks, not how many tuples they actually contain. > 2. I think we could add this test to the existing pg_combinebackup's > test file instead of creating a new file with a single-test. See the > attached version; it’s a bit smaller than your original patch, but > since I haven't copied all of your comments yet, I’ve marked it as > WIP. -1. This kind of thing tends to make the tests harder to understand. -- Robert Haas EDB: http://www.enterprisedb.com
Hi,
Here is a refactored test.
Now, it creates data depending on the relation block size, so it works even if the segment size is not standard. I tested it locally with segment_size_blocks = 6, and it works correctly.
I would be happy to hear your comments or suggestions.
Regards,
Oleg
On Dec 18, 2025, at 15:26, Robert Haas <robertmhaas@gmail.com> wrote:On Thu, Dec 18, 2025 at 1:05 AM Amul Sul <sulamul@gmail.com> wrote:Test looks good to me, but I have three suggestions as follow:
1. To minimize repetition in insert: use fillfactor 10, which is the
minimal we can set for a table, so that we can minimize tuples per
page. Use a longer string and lower count in repeat(), which I believe
helps the test become a bit faster.
I haven't checked how big a relation the test case creates, but it's
worth keeping in mind that the CI tests run on one platform with the
segment size set to six blocks. I think we should design the test case
with that in mind i.e. don't worry about catching the bug when the
segment size is 1GB, but make sure the test fails in CI without the
bug fix. Let's not rely on fillfactor -- the cost here is the disk
space and the time to write the blocks, not how many tuples they
actually contain.2. I think we could add this test to the existing pg_combinebackup's
test file instead of creating a new file with a single-test. See the
attached version; it’s a bit smaller than your original patch, but
since I haven't copied all of your comments yet, I’ve marked it as
WIP.
-1. This kind of thing tends to make the tests harder to understand.
--
Robert Haas
EDB: http://www.enterprisedb.com
Вложения
On Thu, Dec 18, 2025 at 12:24 PM Oleg Tkachenko <oatkachenko@gmail.com> wrote: > Here is a refactored test. > > Now, it creates data depending on the relation block size, so it works even if the segment size is not standard. I testedit locally with segment_size_blocks = 6, and it works correctly. > > I would be happy to hear your comments or suggestions. Hi Oleg, I have been mostly on vacation since you sent this email, but here I am back again. I tried running this on CI with and without the actual code fix, and was pleased to see the CI failed on this test without the code fix and passed with it. But then I noticed that you hadn't updated meson.build in src/bin/pg_basebackup for the new test, which means that the test was only running in configure/make builds and not in meson/ninja builds. When I fixed that, things didn't look so good. The test then fails: pg_combinebackup: reconstructing "/tmp/cirrus-ci-build/build/testrun/pg_basebackup/050_incremental_backup_truncation_block/data/t_050_incremental_backup_truncation_block_node2_data/pgdata/base/5/16384.1" (1 blocks, checksum CRC32C) pg_combinebackup: reconstruction plan: 0:/tmp/cirrus-ci-build/build/testrun/pg_basebackup/050_incremental_backup_truncation_block/data/t_050_incremental_backup_truncation_block_primary_data/backup/full/base/5/16384.1@0 pg_combinebackup: read 1 blocks from "/tmp/cirrus-ci-build/build/testrun/pg_basebackup/050_incremental_backup_truncation_block/data/t_050_incremental_backup_truncation_block_primary_data/backup/full/base/5/16384.1" pg_combinebackup: reconstructing "/tmp/cirrus-ci-build/build/testrun/pg_basebackup/050_incremental_backup_truncation_block/data/t_050_incremental_backup_truncation_block_node2_data/pgdata/base/5/16384_vm" (131072 blocks, checksum CRC32C) pg_combinebackup: reconstruction plan: 0-3:/tmp/cirrus-ci-build/build/testrun/pg_basebackup/050_incremental_backup_truncation_block/data/t_050_incremental_backup_truncation_block_primary_data/backup/full/base/5/16384_vm@24576 4:/tmp/cirrus-ci-build/build/testrun/pg_basebackup/050_incremental_backup_truncation_block/data/t_050_incremental_backup_truncation_block_primary_data/backup/incr/base/5/INCREMENTAL.16384_vm@8192 5-131071:zero pg_combinebackup: error: could not write file "/tmp/cirrus-ci-build/build/testrun/pg_basebackup/050_incremental_backup_truncation_block/data/t_050_incremental_backup_truncation_block_node2_data/pgdata/base/5/16384_vm": No space left on device I'm not sure what's going on here exactly, but it seems bad. The output implies that 16384_vm is a full 1GB in size, which doesn't really make any sense to me at all, but the same thing also happens when I run the test locally. The VM fork is normally quite small compared to the data, and here the data is only one block over 1GB, so I'd expect the VM fork to be just a few blocks. Are we somehow confusing the length of the VM fork with the length of the main fork? A couple of stylistic notes: All of the existing incremental backup tests are in src/bin/pg_combinebackup/t. I suggest putting this one there too. Normally, our TAP test names are all lower-case, so do the same here. Try to format the test file so that things fit within 80 columns, by breaking comments and Perl statements at appropriate points. Consider running src/tools/pgindent/pgperltidy over the script to check that the way you've broken the Perl statements won't get reindented. -- Robert Haas EDB: http://www.enterprisedb.com
Hello Robert,
I checked the VM fork file and found that its incremental version has a wrong
block number in the header:
```
xxd -l 12 INCREMENTAL.16384_vm
0d1f aed3 0100 0000 0000 0200 <--- 131072 blocks (1 GB)
^^^^ ^^^^
```
This value can only come from the WAL summaries, so I checked them too.
One of the summary files contains:
```
TS 1663, DB 5, REL 16384, FORK main: limit 131073
TS 1663, DB 5, REL 16384, FORK vm: limit 131073
TS 1663, DB 5, REL 16384, FORK vm: block 4
```
Both forks have the same limit, which looks wrong.
So I checked the WAL files to see what really happened with the VM fork.
I did not find any “truncate" records for the VM file.
I only found this record for the main fork
(actually, the fork isn’t mentioned at all):
```
rmgr: Storage len (rec/tot): 46/46, tx: 759, lsn: 0/4600D318,
prev 0/4600B2C8, desc: TRUNCATE base/5/16384 to 131073 blocks flags 7
```
This suggests that the WAL summarizer may be mixing up information between
relation forks.
I also noticed this comment in basebackup_incremental.c:
```
/*
* The free-space map fork is not properly WAL-logged, so we need to
* backup the entire file every time.
*/
if (forknum == FSM_FORKNUM)
return BACK_UP_FILE_FULLY;
```
Maybe we should treat the VM fork the same way and always back it up fully?
Another option is to fix the summarizer so it handles forks correctly.
Best regards,
Oleg
> On Jan 5, 2026, at 17:05, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Dec 18, 2025 at 12:24 PM Oleg Tkachenko <oatkachenko@gmail.com> wrote:
>> Here is a refactored test.
>>
>> Now, it creates data depending on the relation block size, so it works even if the segment size is not standard. I
testedit locally with segment_size_blocks = 6, and it works correctly.
>>
>> I would be happy to hear your comments or suggestions.
>
> Hi Oleg,
>
> I have been mostly on vacation since you sent this email, but here I
> am back again. I tried running this on CI with and without the actual
> code fix, and was pleased to see the CI failed on this test without
> the code fix and passed with it. But then I noticed that you hadn't
> updated meson.build in src/bin/pg_basebackup for the new test, which
> means that the test was only running in configure/make builds and not
> in meson/ninja builds. When I fixed that, things didn't look so good.
> The test then fails:
>
> pg_combinebackup: reconstructing
>
"/tmp/cirrus-ci-build/build/testrun/pg_basebackup/050_incremental_backup_truncation_block/data/t_050_incremental_backup_truncation_block_node2_data/pgdata/base/5/16384.1"
> (1 blocks, checksum CRC32C)
> pg_combinebackup: reconstruction plan:
>
0:/tmp/cirrus-ci-build/build/testrun/pg_basebackup/050_incremental_backup_truncation_block/data/t_050_incremental_backup_truncation_block_primary_data/backup/full/base/5/16384.1@0
> pg_combinebackup: read 1 blocks from
>
"/tmp/cirrus-ci-build/build/testrun/pg_basebackup/050_incremental_backup_truncation_block/data/t_050_incremental_backup_truncation_block_primary_data/backup/full/base/5/16384.1"
> pg_combinebackup: reconstructing
>
"/tmp/cirrus-ci-build/build/testrun/pg_basebackup/050_incremental_backup_truncation_block/data/t_050_incremental_backup_truncation_block_node2_data/pgdata/base/5/16384_vm"
> (131072 blocks, checksum CRC32C)
> pg_combinebackup: reconstruction plan:
>
0-3:/tmp/cirrus-ci-build/build/testrun/pg_basebackup/050_incremental_backup_truncation_block/data/t_050_incremental_backup_truncation_block_primary_data/backup/full/base/5/16384_vm@24576
>
4:/tmp/cirrus-ci-build/build/testrun/pg_basebackup/050_incremental_backup_truncation_block/data/t_050_incremental_backup_truncation_block_primary_data/backup/incr/base/5/INCREMENTAL.16384_vm@8192
> 5-131071:zero
> pg_combinebackup: error: could not write file
>
"/tmp/cirrus-ci-build/build/testrun/pg_basebackup/050_incremental_backup_truncation_block/data/t_050_incremental_backup_truncation_block_node2_data/pgdata/base/5/16384_vm":
> No space left on device
>
> I'm not sure what's going on here exactly, but it seems bad. The
> output implies that 16384_vm is a full 1GB in size, which doesn't
> really make any sense to me at all, but the same thing also happens
> when I run the test locally. The VM fork is normally quite small
> compared to the data, and here the data is only one block over 1GB, so
> I'd expect the VM fork to be just a few blocks. Are we somehow
> confusing the length of the VM fork with the length of the main fork?
>
> A couple of stylistic notes: All of the existing incremental backup
> tests are in src/bin/pg_combinebackup/t. I suggest putting this one
> there too. Normally, our TAP test names are all lower-case, so do the
> same here. Try to format the test file so that things fit within 80
> columns, by breaking comments and Perl statements at appropriate
> points. Consider running src/tools/pgindent/pgperltidy over the script
> to check that the way you've broken the Perl statements won't get
> reindented.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com