Обсуждение: amcheck: add index-all-keys-match verification for B-Tree
Hi hackers, Attached patch adds a new "indexallkeysmatch" option to bt_index_check() and bt_index_parent_check() that verifies each index tuple points to a heap tuple with the same key - the reverse of "heapallindexed". I need the tool to investigate corruption, possibly inflicted by us ourselves. But the tool might be useful for the community too. We hit B-tree corruptions where index entries stored different keys than their heap tuples (e.g. "foobar" in index vs "foo-bar" in heap). This happened with UTF-8 Russian locales around hyphens/spaces. The index structure stayed valid so existing checks didn't catch it. The implementation uses a Bloom filter to avoid excessive random heap I/O. A sequential heap scan fingerprints visible (key, tid) pairs first. During the index traversal, each leaf tuple is probed against the filter; only when the filter says "missing" do we fetch the heap tuple and compare keys. Posting list entries are expanded and checked individually. When both heapallindexed and indexallkeysmatch are enabled, the heap is scanned twice. Combining them into one pass would complicate the code and possibly introduce some errors. There's also a TAP test that detects corruption via expression function swap. Someone might consider not using bug (corrupting indexes by changing expression) in tests, but it's already used, so I reused this bug too. WDYT? Would you like to see it on CF, or do we have enough amcheck patches there already and it's better to postpone it to v20? Best regards, Andrey Borodin.
Вложения
Hello
+ /* Only verify tuples pointing to visible heap rows */
+ if (!heap_entry_is_visible(state, tid))
+ return;
...
+ slot = table_slot_create(state->heaprel, NULL);
+ found = table_tuple_fetch_row_version(state->heaprel, tid,
+ state->snapshot, slot);
+ if (!found)
+ {
This seems like a duplication, heap_entry_is_visible does the same
thing and returns found.
This also means that the if (!found) block should be unreachable?
Wouldn't it be simpler to remove the is_visible check completely and
use an if(found) later?
+ indexinfo = state->indexinfo;
+ estate = CreateExecutorState();
+ GetPerTupleExprContext(estate)->ecxt_scantuple = slot;
+ FormIndexDatum(indexinfo, slot, estate, values, isnull);
+ FreeExecutorState(estate);
This doesn't need the same cleanup code as heapam.c:1754 and :1997?
Seems like state remains, so we have dangling pointers there and could
crash later.
/* These may have been pointing to the now-gone estate */
indexInfo->ii_ExpressionsState = NIL;
indexInfo->ii_PredicateState = NULL;
Hi Zsolt!
Many thanks for the review!
> On 26 Feb 2026, at 04:18, Zsolt Parragi <zsolt.parragi@percona.com> wrote:
>
> Hello
>
> + /* Only verify tuples pointing to visible heap rows */
> + if (!heap_entry_is_visible(state, tid))
> + return;
> ...
> + slot = table_slot_create(state->heaprel, NULL);
> + found = table_tuple_fetch_row_version(state->heaprel, tid,
> + state->snapshot, slot);
> + if (!found)
> + {
>
> This seems like a duplication, heap_entry_is_visible does the same
> thing and returns found.
>
> This also means that the if (!found) block should be unreachable?
>
> Wouldn't it be simpler to remove the is_visible check completely and
> use an if(found) later?
Yup, done so.
Also thinking about it more, I decided to add more corruption checks here:
non-existent tuples are also kind of corruption that must be reported.
This issue addressed in patch 2 so you can verify your notes were addressed.
I have a gut feeling that we might do without snapshot at all...
>
> + indexinfo = state->indexinfo;
> + estate = CreateExecutorState();
> + GetPerTupleExprContext(estate)->ecxt_scantuple = slot;
> + FormIndexDatum(indexinfo, slot, estate, values, isnull);
> + FreeExecutorState(estate);
>
> This doesn't need the same cleanup code as heapam.c:1754 and :1997?
> Seems like state remains, so we have dangling pointers there and could
> crash later.
>
> /* These may have been pointing to the now-gone estate */
> indexInfo->ii_ExpressionsState = NIL;
> indexInfo->ii_PredicateState = NULL;
Yup, fixed and added tests exercising ii_ExpressionsState.
Thanks!
Best regards, Andrey Borodin.
Вложения
> On 26 Feb 2026, at 10:28, Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > I have a gut feeling that we might do without snapshot at all... I decided that it's a good idea to verify only visible tuples. Mismatch of dead tuples might be bad sign, but it's not a corruption. So we need a snapshot. PFA patch with documentation. Thanks! Best regards, Andrey Borodin.
Вложения
I don't see any other logic problems in the code, I only have a few
minor comments/questions:
+ * Bloom filter says (key, tid) not in heap. Follow TID to verify; this
+ * amortizes random heap lookups when the filter has false negatives, or
This comment could be a bit confusing, as bloom filters typically have
false positives, not negatives.
Maybe it would be better to phrase this somehow differently?
Another thing is that now amcheck can create two bloom filters,
allocated at the same time, both up to maintenance_work_mem.
Isn't that a detail that should be at least mentioned somewhere?
And in the documentation, shouldn't this new check mention something
similar to heapallindexed, which describes the check possibly missing
corruption because of the bloom filter?
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("index tuple in index \"%s\" does not match heap tuple",
+ RelationGetRelationName(state->rel)),
Shouldn't this also print out the table name?
> On 3 Mar 2026, at 04:19, Zsolt Parragi <zsolt.parragi@percona.com> wrote:
>
> I don't see any other logic problems in the code, I only have a few
> minor comments/questions:
>
> + * Bloom filter says (key, tid) not in heap. Follow TID to verify; this
> + * amortizes random heap lookups when the filter has false negatives, or
>
> This comment could be a bit confusing, as bloom filters typically have
> false positives, not negatives.
> Maybe it would be better to phrase this somehow differently?
>
> Another thing is that now amcheck can create two bloom filters,
> allocated at the same time, both up to maintenance_work_mem.
> Isn't that a detail that should be at least mentioned somewhere?
>
> And in the documentation, shouldn't this new check mention something
> similar to heapallindexed, which describes the check possibly missing
> corruption because of the bloom filter?
>
> + (errcode(ERRCODE_INDEX_CORRUPTED),
> + errmsg("index tuple in index \"%s\" does not match heap tuple",
> + RelationGetRelationName(state->rel)),
>
> Shouldn't this also print out the table name?
Thanks for the review! I've addressed all these issues in patch steps 4 and 5. Other steps are intact. PFA.
Best regards, Andrey Borodin.
Вложения
- v4-0001-amcheck-add-indexallkeysmatch-verification-for-B-.patch
- v4-0005-amcheck-docs-note-two-Bloom-filters-memory-usage-.patch
- v4-0004-amcheck-address-review-fix-Bloom-filter-comment-a.patch
- v4-0003-Document-indexallkeysmatch-parameter-for-amcheck-.patch
- v4-0002-amcheck-report-corruption-when-index-points-to-no.patch
Hi Andrey
I've been reviewing the patch and ran some concurrent tests against it.
I found an issue with false positive corruption reports under concurrent
VACUUM.
+ slot = table_slot_create(state->heaprel, NULL);
+ found = table_tuple_fetch_row_version(state->heaprel, tid,
+ SnapshotAny, slot);
+ if (!found)
+ {
+ ExecDropSingleTupleTableSlot(slot);
+ ereport(ERROR,
bt_verify_index_tuple_points_to_heap uses SnapshotAny with
table_tuple_fetch_row_version to distinguish "tuple doesn't exist" from
"tuple exists but is dead". However, bt_index_check only holds
AccessShareLock which compatible with VACUUM's ShareUpdateExclusiveLock.
A concurrent VACUUM Phase 1 can prune heap pages(ItemIdSetDead) between
Bloom filter probe and the heap lookup. Causing
table_tuple_fetch_row_version to return false for a legal
dead-but-now-pruned tuple and a false positive corruption report.
The table_tuple_satisfies_snapshot check does not help, it only runs
when the fetch succeeds(LP_NORMAL). Once VACUUM sets LP_DEAD,
heapam_fetch sees !ItemIdIsNormal(lp) and returns false before any
snapshot checks.
The reproduce should be: DELETE all rows form a big table, then launch
VACUUM and bt_index_check concurrently. A POC test script attached.