Обсуждение: Bug in amcheck?
Hi hackers.
We see the following error reported by amcheck (I have added dump of
opaque) when it interleaves with autovacuum and cancel pt:
ERROR: mismatch between parent key and child high key in index
"pg_attribute_relid_attnam_index"
DETAIL: Target block=274, target opaque->flags=0, child block=427,
child opaque=11, target page lsn=1/484A8FC8.
CONTEXT: SQL statement "SELECT bt_index_parent_check(indexrelid, true,
true) from pg_index"
So child has BTP_HALF_DEAD bit set.
Autovacuum is interrupted in this place in _bt_pagedel:
/*
* Check here, as calling loops will have locks held, preventing
* interrupts from being processed.
*/
CHECK_FOR_INTERRUPTS();
Reproducing it is not so easy.
First of all I added sleep here:
/*
* Check here, as calling loops will have locks held, preventing
* interrupts from being processed.
*/
pg_usleep(10000);
CHECK_FOR_INTERRUPTS();
Then I create two procedures:
create or replace procedure create_tables(tables integer, partitions
integer) as $$
declare
i integer;
j integer;
begin
for i in 1..tables
loop
execute 'DROP TABLE IF EXISTS t_' || i;
execute 'CREATE TABLE t_' || i || '(pk integer) partition by
range (pk)';
for j in 1..partitions
loop
execute 'create table p_'||i||'_'||j||' partition of
t_'||i||' for values from ('||j||') to ('||(j + 1)||')';
end loop;
execute 'insert into t_'||i||' values
(generate_series(1,'||partitions||'))';
end loop;
end;
$$ language plpgsql;
and
create or replace procedure run_amcheck() as $$
begin
loop
if (select count(*) from pg_stat_activity where
backend_type='autovacuum worker') > 0
then
raise notice 'Run amcheck!';
perform bt_index_parent_check(indexrelid, true, true) from
pg_index;
end if;
perform pg_sleep(1);
end loop;
end;
$$ language plpgsql;
Then I run concurrently run_amcheck()
and the following script for pgbench:
call create_tables(2,1000);
select pg_sleep(2);
If the problem is not reproduced, then cancel run_amcheck() and restart
it once again.
Backtrace (pg16) is the following:
* frame #0: 0x00000001017b6aac
amcheck.dylib`bt_child_highkey_check(state=0x000000010c846318,
target_downlinkoffnum=37, loaded_child="\U00000001", target_level=1) at
verify_nbtree.c:2146:23
frame #1: 0x00000001017b7fd8
amcheck.dylib`bt_child_check(state=0x000000010c846318,
targetkey=0x000000013c01c448, downlinkoffnum=37) at verify_nbtree.c:2262:2
frame #2: 0x00000001017b5f4c
amcheck.dylib`bt_target_page_check(state=0x000000010c846318) at
verify_nbtree.c:1623:4
frame #3: 0x00000001017b3908
amcheck.dylib`bt_check_level_from_leftmost(state=0x000000010c846318,
level=(level = 1, leftmost = 3, istruerootlevel = false)) at
verify_nbtree.c:859:3
frame #4: 0x00000001017b24e8
amcheck.dylib`bt_check_every_level(rel=0x0000000140074f18,
heaprel=0x0000000130070148, heapkeyspace=true, readonly=true,
heapallindexed=true, rootdescend=true) at verify_nbtree.c:603:13
frame #5: 0x00000001017b198c
amcheck.dylib`bt_index_check_internal(indrelid=2674, parentcheck=true,
heapallindexed=true, rootdescend=true) at verify_nbtree.c:362:3
frame #6: 0x00000001017b1a78
amcheck.dylib`bt_index_parent_check(fcinfo=0x000000010c83b040) at
verify_nbtree.c:242:2
I wonder if we should add P_ISHALFDEAD(opaque) for child page?
Hello! > I wonder if we should add P_ISHALFDEAD(opaque) for child page? I am not a btree expert, but things I was able to find so far: In commit d114cc538715e14d29d6de8b6ea1a1d5d3e0edb4 next check is added: > bt_child_highkey_check(state, downlinkoffnum, > child, topaque->btpo_level); At the same time there is a comment below: > * We go ahead with our checks if the child page is half-dead. It's safe > * to do so because we do not test the child's high key, so it does not > * matter that the original high key will have been replaced by a dummy > * truncated high key within _bt_mark_page_halfdead(). All other page > * items are left intact on a half-dead page, so there is still something > * to test. So, yes, it looks like we need to skip the child's high key test for half-dead pages. BWT, have you tried to create an injection_point-based reproducer? Best regards, Mikhail.
On 02/11/2025 2:27 PM, Mihail Nikalayeu wrote: > Hello! > >> I wonder if we should add P_ISHALFDEAD(opaque) for child page? > I am not a btree expert, but things I was able to find so far: > > In commit d114cc538715e14d29d6de8b6ea1a1d5d3e0edb4 next check is added: > >> bt_child_highkey_check(state, downlinkoffnum, >> child, topaque->btpo_level); > At the same time there is a comment below: > >> * We go ahead with our checks if the child page is half-dead. It's safe >> * to do so because we do not test the child's high key, so it does not >> * matter that the original high key will have been replaced by a dummy >> * truncated high key within _bt_mark_page_halfdead(). All other page >> * items are left intact on a half-dead page, so there is still something >> * to test. > So, yes, it looks like we need to skip the child's high key test for > half-dead pages. > > BWT, have you tried to create an injection_point-based reproducer? > > Best regards, > Mikhail. Hello Mikhail, Thank you very much for looking at this issue. And I am very sorry for delay with answer. Unfortunately I was not able to reproduce the problem for the latest Postgres: neither with injection points, neither with my original approach with sleeps. Originally I investigated the customer's problem with PG16. And have reproduced it for pg16,. I checked that relevant amcheck code was not changed since pg16, so I thought that the problem takes place for all Postgres versions. But looks like it is not true.
Hello! > Originally I investigated the customer's problem with PG16. And have > reproduced it for pg16,. I checked that relevant amcheck code was not > changed since pg16, so I thought that the problem takes place for all > Postgres versions. But looks like it is not true. I think it is still broken, but with less probability. Have you tried injection points on v16? Such a test case will make things much more clear. Also, I added Alexander to CC (he is author of bt_child_highkey_check) - maybe the issue is easily understandable for him. Best regards, Mikhail.
On 19/11/2025 00:19, Mihail Nikalayeu wrote:
> Hello!
>
>> Originally I investigated the customer's problem with PG16. And have
>> reproduced it for pg16,. I checked that relevant amcheck code was not
>> changed since pg16, so I thought that the problem takes place for all
>> Postgres versions. But looks like it is not true.
>
> I think it is still broken, but with less probability.
> Have you tried injection points on v16? Such a test case will make
> things much more clear.
Konstantin's original repro involved autovacuum and concurrent sessions.
I was confused by that, because bt_index_parent_check() holds a
ShareLock, which prevents it from running concurrently with vacuum. But
this isn't a race condition as such, the issue arises whenever there is
a half-dead page in the index. To end up with a half-dead page, you need
to gracefully cancel/interrupt autovacuum while it's deleting a page.
The repro's way of canceling autovacuum was very complicated. I didn't
fully understand it, but I think the concurrent dropping/creating of
tables would sometimes cause autovauum to be canceled.
Here's a much more straightforward repro. Apply this little patch:
diff --git a/src/backend/access/nbtree/nbtpage.c
b/src/backend/access/nbtree/nbtpage.c
index 30b43a4dd18..c132fc90277 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -2353,6 +2353,12 @@ _bt_unlink_halfdead_page(Relation rel, Buffer
leafbuf, BlockNumber scanblkno,
* Check here, as calling loops will have locks held, preventing
* interrupts from being processed.
*/
+ if (random() < INT32_MAX / 2)
+ {
+ elog(ERROR, "aborting page deletion");
+ }
+ else
+ elog(NOTICE, "unlinking halfdead page %u %u",
BufferGetBlockNumber(leafbuf), scanblkno);
CHECK_FOR_INTERRUPTS();
/* Unlink the current top parent of the subtree */
Then run this:
postgres=# CREATE TABLE amchecktest (id int4);
CREATE TABLE
postgres=# INSERT INTO amchecktest SELECT g from generate_series(1,
1000000) g;
INSERT 0 1000000
postgres=# CREATE INDEX on amchecktest (id);
CREATE INDEX
postgres=# DELETE FROM amchecktest WHERE id > 100000 AND id < 120000;
DELETE 19999
postgres=# -- this will hit the error added by the patch
VACUUM amchecktest;
ERROR: aborting page deletion
CONTEXT: while vacuuming index "amchecktest_id_idx" of relation
"public.amchecktest"
postgres=# select bt_index_parent_check('amchecktest_id_idx');
ERROR: mismatch between parent key and child high key in index
"amchecktest_id_idx"
DETAIL: Target block=3 child block=276 target page lsn=0/6ED0DB68.
To fix this, I guess we need to teach bt_index_parent_check() about
half-dead pages. Anyone volunteer to write that patch?
- Heikki
On Tue, Nov 25, 2025 at 3:03 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > To fix this, I guess we need to teach bt_index_parent_check() about > half-dead pages. Anyone volunteer to write that patch? It's not like bt_index_parent_check doesn't generally know about them. For example, bt_downlink_missing_check goes to great lengths to distinguish between legitimate "missing" downlinks caused by an interrupted page deletion, and real missing downlinks caused by corruption. The problem we're seeing here seems likely limited to code added by commit d114cc53, which enhanced bt_index_parent_check by adding the new bt_child_highkey_check check. bt_child_highkey_check actually reuses bt_downlink_missing_check (which deals with half-dead pages correctly), but still isn't careful enough about half-dead pages. This is kind of surprising, since it *does* account for incomplete splits, which are similar. In short, I think that we need to track something like BtreeCheckState.previncompletesplit, but for half-dead pages. And then actually use that within bt_child_highkey_check, to avoid spurious false-positive reports of corruption. -- Peter Geoghegan
On 25/11/2025 22:51, Peter Geoghegan wrote:
> On Tue, Nov 25, 2025 at 3:03 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> To fix this, I guess we need to teach bt_index_parent_check() about
>> half-dead pages. Anyone volunteer to write that patch?
>
> It's not like bt_index_parent_check doesn't generally know about them.
> For example, bt_downlink_missing_check goes to great lengths to
> distinguish between legitimate "missing" downlinks caused by an
> interrupted page deletion, and real missing downlinks caused by
> corruption.
>
> The problem we're seeing here seems likely limited to code added by
> commit d114cc53, which enhanced bt_index_parent_check by adding the
> new bt_child_highkey_check check. bt_child_highkey_check actually
> reuses bt_downlink_missing_check (which deals with half-dead pages
> correctly), but still isn't careful enough about half-dead pages. This
> is kind of surprising, since it *does* account for incomplete splits,
> which are similar.
+1. It took me a moment to understand bt_downlink_missing_check(). The
comments there talk about interrupted page deletions and incomplete
splits, but it's actually never called for half-dead pages. The caller
checks !P_IGNORE(opaque), and P_IGNORE means BTP_DELETED | BTP_HALF_DEAD.
I don't think BTP_DELETED pages should be reachable here at all. So
perhaps we should specifically check BTP_DELETED and give an ERROR on
those. And for clarity, perhaps move the check for BTP_HALF_DEAD into
bt_downlink_missing_check(), alongside the incomplete-split check, so
that both cases would be checked at the same place. Just for clarity,
it's not wrong as it is.
> In short, I think that we need to track something like
> BtreeCheckState.previncompletesplit, but for half-dead pages. And then
> actually use that within bt_child_highkey_check, to avoid spurious
> false-positive reports of corruption.
I think it's even simpler than that, and we can just do this:
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -2268,7 +2268,7 @@ bt_child_highkey_check(BtreeCheckState *state,
* If we visit page with high key, check that it is equal to the
* target key next to corresponding downlink.
*/
- if (!rightsplit && !P_RIGHTMOST(opaque))
+ if (!rightsplit && !P_RIGHTMOST(opaque) && !P_ISHALFDEAD(opaque))
{
BTPageOpaque topaque;
IndexTuple highkey;
Both BTP_HALF_DEAD and BTP_INCOMPLETE_SPLIT indicate that a downlink is
missing, but they work slightly differently. If a page is marked as
BTP_INCOMPLETE_SPLIT, it means that its right sibling has no downlink,
but if a page is marked as BTP_HALF_DEAD, it means that the page itself
has no downlink.
- Heikki
On 26/11/2025 10:20, Heikki Linnakangas wrote:
> On 25/11/2025 22:51, Peter Geoghegan wrote:
>> In short, I think that we need to track something like
>> BtreeCheckState.previncompletesplit, but for half-dead pages. And then
>> actually use that within bt_child_highkey_check, to avoid spurious
>> false-positive reports of corruption.
>
> I think it's even simpler than that, and we can just do this:
>
> --- a/contrib/amcheck/verify_nbtree.c
> +++ b/contrib/amcheck/verify_nbtree.c
> @@ -2268,7 +2268,7 @@ bt_child_highkey_check(BtreeCheckState *state,
> * If we visit page with high key, check that it is equal to the
> * target key next to corresponding downlink.
> */
> - if (!rightsplit && !P_RIGHTMOST(opaque))
> + if (!rightsplit && !P_RIGHTMOST(opaque) && !P_ISHALFDEAD(opaque))
> {
> BTPageOpaque topaque;
> IndexTuple highkey;
>
>
> Both BTP_HALF_DEAD and BTP_INCOMPLETE_SPLIT indicate that a downlink is
> missing, but they work slightly differently. If a page is marked as
> BTP_INCOMPLETE_SPLIT, it means that its right sibling has no downlink,
> but if a page is marked as BTP_HALF_DEAD, it means that the page itself
> has no downlink.
Ok, here's a proper patch with tests. The patch itself is the above
one-liner. It's in patch 0004.
While testing this, I bumped into another similar amcheck bug: if the
root page split is interrupted, verify_btree() complains:
ERROR: block 3 is not true root in index "nbtree_incomplete_splits_i_idx"
Attached patch 0002 contains a fix and a test for that. The fix for that
is also one-liner.
Summary of the patches:
Patch 0001 adds an injection point test for incomplete splits. We
already had such a test for GIN, which handles incomplete splits the
same way as B-tree. I copy-pasted and adapted the GIN test for B-tree.
This was an easy way to increase our test coverage.
Patch 0002 fixes the incomplete-root-split bug in amcheck. It modifies
the test added in patch 0001 to cover the bug fix.
Patch 0003 adds a test for half-dead pages, similar to what 0001 did for
incomplete splits.
Patch 0004 fixes the bogus half-deaf-page error in amcheck, i.e. the
issue that started this thread. It modifies the test introduced in patch
0003 to add amcheck calls, to cover the bug fix.
- Heikki
Вложения
On Mon, Dec 1, 2025 at 4:20 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Ok, here's a proper patch with tests. The patch itself is the above > one-liner. It's in patch 0004. > > While testing this, I bumped into another similar amcheck bug: if the > root page split is interrupted, verify_btree() complains: > > ERROR: block 3 is not true root in index "nbtree_incomplete_splits_i_idx" > > Attached patch 0002 contains a fix and a test for that. The fix for that > is also one-liner. Good catch. > Summary of the patches: > > Patch 0001 adds an injection point test for incomplete splits. We > already had such a test for GIN, which handles incomplete splits the > same way as B-tree. I copy-pasted and adapted the GIN test for B-tree. > This was an easy way to increase our test coverage. > > Patch 0002 fixes the incomplete-root-split bug in amcheck. It modifies > the test added in patch 0001 to cover the bug fix. > > Patch 0003 adds a test for half-dead pages, similar to what 0001 did for > incomplete splits. > > Patch 0004 fixes the bogus half-deaf-page error in amcheck, i.e. the > issue that started this thread. It modifies the test introduced in patch > 0003 to add amcheck calls, to cover the bug fix. All seem reasonable. These tests will increase nbtree code coverage quite a bit, which is a nice bonus. -- Peter Geoghegan
On 02/12/2025 19:59, Peter Geoghegan wrote: > On Mon, Dec 1, 2025 at 4:20 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> Summary of the patches: >> >> Patch 0001 adds an injection point test for incomplete splits. We >> already had such a test for GIN, which handles incomplete splits the >> same way as B-tree. I copy-pasted and adapted the GIN test for B-tree. >> This was an easy way to increase our test coverage. >> >> Patch 0002 fixes the incomplete-root-split bug in amcheck. It modifies >> the test added in patch 0001 to cover the bug fix. >> >> Patch 0003 adds a test for half-dead pages, similar to what 0001 did for >> incomplete splits. >> >> Patch 0004 fixes the bogus half-deaf-page error in amcheck, i.e. the >> issue that started this thread. It modifies the test introduced in patch >> 0003 to add amcheck calls, to cover the bug fix. > > All seem reasonable. > > These tests will increase nbtree code coverage quite a bit, which is a > nice bonus. Committed, thanks for the review! - Heikki
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On 02/12/2025 19:59, Peter Geoghegan wrote:
>> These tests will increase nbtree code coverage quite a bit, which is a
>> nice bonus.
> Committed, thanks for the review!
In the past couple of days, scorpion and skink have failed
the nbtree_half_dead_pages test with identical symptoms [1][2]:
@@ -41,8 +41,6 @@
(1 row)
vacuum nbtree_half_dead_pages;
-ERROR: error triggered for injection point nbtree-leave-page-half-dead
-CONTEXT: while vacuuming index "nbtree_half_dead_pages_id_idx" of relation "public.nbtree_half_dead_pages"
SELECT injection_points_detach('nbtree-leave-page-half-dead');
injection_points_detach
-------------------------
@@ -67,7 +65,6 @@
-- Finish the deletion and re-check
vacuum nbtree_half_dead_pages;
-NOTICE: notice triggered for injection point nbtree-finish-half-dead-page-vacuum
select * from nbtree_half_dead_pages where id > 99998 and id < 120002;
id
--------
No idea why it took a month for this to show up. Could some
recent change have affected it?
regards, tom lane
[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=scorpion&dt=2026-01-02%2004%3A54%3A38
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-12-31%2003%3A34%3A51