Re: amcheck (B-Tree integrity checking tool)

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: amcheck (B-Tree integrity checking tool)
Дата
Msg-id 20161116210303.a2oaelkk2vavr5vq@alap3.anarazel.de
обсуждение исходный текст
Ответ на Re: amcheck (B-Tree integrity checking tool)  (Peter Geoghegan <pg@heroku.com>)
Ответы Re: amcheck (B-Tree integrity checking tool)
Список pgsql-hackers
Hi,


I think the patch could use a pgindent run.

On 2016-09-07 11:44:01 -0700, Peter Geoghegan wrote:
> diff --git a/contrib/amcheck/amcheck--1.0.sql b/contrib/amcheck/amcheck--1.0.sql
> new file mode 100644
> index 0000000..ebbd6ac
> --- /dev/null
> +++ b/contrib/amcheck/amcheck--1.0.sql
> @@ -0,0 +1,20 @@
> +/* contrib/amcheck/amcheck--1.0.sql */
> +
> +-- complain if script is sourced in psql, rather than via CREATE EXTENSION
> +\echo Use "CREATE EXTENSION amcheck" to load this file. \quit
> +
> +--
> +-- bt_index_check()
> +--
> +CREATE FUNCTION bt_index_check(index regclass)
> +RETURNS VOID
> +AS 'MODULE_PATHNAME', 'bt_index_check'
> +LANGUAGE C STRICT;
> +
> +--
> +-- bt_index_parent_check()
> +--
> +CREATE FUNCTION bt_index_parent_check(index regclass)
> +RETURNS VOID
> +AS 'MODULE_PATHNAME', 'bt_index_parent_check'
> +LANGUAGE C STRICT;

I'd really want a function that runs all check on a table.


> +#define CHECK_SUPERUSER() { \
> +        if (!superuser()) \
> +            ereport(ERROR, \
> +                    (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \
> +                     (errmsg("must be superuser to use verification functions")))); }

Why is this a macro?


> +/*
> + * Callers to verification functions should never receive a false positive
> + * indication of corruption.  Therefore, when using amcheck functions for
> + * stress testing, it may be useful to temporally change the CORRUPTION elevel
> + * to PANIC, to immediately halt the server in the event of detecting an
> + * invariant condition violation.  This may preserve more information about the
> + * nature of the underlying problem.  Note that modifying the CORRUPTION
> + * constant to be an elevel < ERROR is not well tested.
> + */
> +#ifdef NOT_USED
> +#define CORRUPTION        PANIC
> +#define CONCERN            WARNING
> +#define POSITION        NOTICE
> +#else
> +#define CORRUPTION        ERROR
> +#define CONCERN            DEBUG1
> +#define POSITION        DEBUG2
> +#endif

Hm, if we want that - and it doesn't seem like a bad idea - I think we
should be make it available without recompiling.


> +/*
> + * A B-Tree cannot possibly have this many levels, since there must be one
> + * block per level, which is bound by the range of BlockNumber:
> + */
> +#define InvalidBtreeLevel    ((uint32) InvalidBlockNumber)

Hm, I think it'd be, for the long term, better if we'd move the btree
check code to amcheck_nbtree.c or such.


> +Datum
> +bt_index_parent_check(PG_FUNCTION_ARGS)
> +{
> +    Oid            indrelid = PG_GETARG_OID(0);
> +    Oid            heapid;
> +    Relation    indrel;
> +    Relation    heaprel;
> +
> +    CHECK_SUPERUSER();
> +
> +    /*
> +     * We must lock table before index to avoid deadlocks.  However, if the
> +     * passed indrelid isn't an index then IndexGetRelation() will fail.
> +     * Rather than emitting a not-very-helpful error message, postpone
> +     * complaining, expecting that the is-it-an-index test below will fail.
> +     */
> +    heapid = IndexGetRelation(indrelid, true);
> +    if (OidIsValid(heapid))
> +        heaprel = heap_open(heapid, ShareLock);
> +    else
> +        heaprel = NULL;
> +
> +    /*
> +     * Open the target index relations separately (like relation_openrv(), but
> +     * with heap relation locked first to prevent deadlocking).  In hot standby
> +     * mode this will raise an error.
> +     */
> +    indrel = index_open(indrelid, ExclusiveLock);

Theoretically we should recheck that the oids still match, theoretically
the locking here allows the index->table mapping to change. It's not a
huge window tho...


> +    /* Check for active uses of the index in the current transaction */
> +    CheckTableNotInUse(indrel, "bt_index_parent_check");

Why do we actually care?



> +static void
> +btree_index_checkable(Relation rel)
> +{
> +    if (rel->rd_rel->relkind != RELKIND_INDEX ||
> +        rel->rd_rel->relam != BTREE_AM_OID)
> +        ereport(ERROR,
> +                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +                 errmsg("only nbtree access method indexes are supported"),
> +                 errdetail("Relation \"%s\" is not a B-Tree index.",
> +                           RelationGetRelationName(rel))));
> +
> +    if (RELATION_IS_OTHER_TEMP(rel))
> +        ereport(ERROR,
> +                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +                 errmsg("cannot access temporary tables of other sessions"),
> +                 errdetail("Index \"%s\" is associated with temporary relation.",
> +                           RelationGetRelationName(rel))));
> +
> +    if (!rel->rd_index->indisready)
> +        ereport(ERROR,
> +                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +                 errmsg("cannot check index \"%s\"",
> +                        RelationGetRelationName(rel)),
> +                 errdetail("Index is not yet ready for insertions")));

Why can't we check this? Seems valuable to allow checking after a
partial CONCURRENTLY index build?

Why don't you use IndexIsReady et al? Not sure if the patch compiles
against an earlier release (there weren't valid/ready/live dead fields,
but instead it was compressed into fewer)?


> +/*
> + * bt_check_every_level()
> + *
> + * Main entry point for B-Tree SQL-callable functions.  Walks the B-Tree in
> + * logical order, verifying invariants as it goes.
> + *
> + * It is the caller's responsibility to acquire appropriate heavyweight lock on
> + * the index relation, and advise us if extra checks are safe when an
> + * ExclusiveLock is held.  An ExclusiveLock is generally assumed to prevent any
> + * kind of physical modification to the index structure, including
> + * modifications that VACUUM may make.
> + */

Hm, what kind of locking does the kill_prior_tuple stuff use?


> +static void
> +bt_check_every_level(Relation rel, bool exclusivelock)
> +{

> +    /*
> +     * Certain deletion patterns can result in "skinny" B-Tree indexes, where
> +     * the fast root and true root differ.
> +     *
> +     * Start from the true root, not the fast root, unlike conventional index
> +     * scans.  This approach is more thorough, and removes the risk of
> +     * following a stale fast root from the meta page.
> +     */
> +    if (metad->btm_fastroot != metad->btm_root)
> +        ereport(CONCERN,
> +                (errcode(ERRCODE_DUPLICATE_OBJECT),
> +                 errmsg("fast root mismatch in index %s",
> +                        RelationGetRelationName(rel)),
> +                 errdetail_internal("Fast block %u (level %u) "
> +                                    "differs from true root "
> +                                    "block %u (level %u).",
> +                                    metad->btm_fastroot, metad->btm_fastlevel,
> +                                    metad->btm_root, metad->btm_level)));

Hm, isn't that a potentially spurious report?


> +static BtreeLevel
> +bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
> +{

> +    do
> +    {
> +        /* Don't rely on CHECK_FOR_INTERRUPTS() calls at lower level */
> +        CHECK_FOR_INTERRUPTS();
> +
> +        /* Initialize state for this iteration */
> +        state->targetblock = current;
> +        state->target = palloc_btree_page(state, state->targetblock);
> +        state->targetlsn = PageGetLSN(state->target);
> +
> +        opaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
> +
> +        if (P_IGNORE(opaque))
> +        {
> +            if (P_RIGHTMOST(opaque))
> +                ereport(CORRUPTION,
> +                        (errcode(ERRCODE_INDEX_CORRUPTED),
> +                         errmsg("block %u fell off the end of index \"%s\"",
> +                                current, RelationGetRelationName(state->rel))));
> +            else
> +                ereport(CONCERN,
> +                        (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +                         errmsg("block %u of index \"%s\" ignored",
> +                                current, RelationGetRelationName(state->rel))));

Hm, why is this worth a WARNING? Isn't it perfectly normal to have a few
of these?


> +            /*
> +             * Before beginning any non-trivial examination of level, establish
> +             * next level down's leftmost block number, which next call here
> +             * will pass as its leftmost (iff this isn't leaf level).

"next level down's leftmost block number" - it'd be good to simplify
that a bit.


> +/*
> + * bt_target_page_check()

Why "target"?

> +static void
> +bt_target_page_check(BtreeCheckState *state)
> +{


> +        /*
> +         *   ********************
> +         *   * Page order check *
> +         *   ********************

Isn't that item order, rather than page order?


> +        }
> +        /*
> +         *   ********************
> +         *   * Last item check  *
> +         *   ********************

Newline before block wouldn't hurt.


> +         * Check last item against next/right page's first data item's when
> +         * last item on page is reached.

I think this need some polishing.



> +        /*
> +         *   ********************
> +         *   * Downlink check   *
> +         *   ********************
> +         *
> +         * Additional check of child items against target page (their parent),
> +         * iff this is internal page and caller holds ExclusiveLock on index

s/is/is an/ /holds/holds an/


> +static ScanKey
> +bt_right_page_check_scankey(BtreeCheckState *state)
> +{

> +     * A deleted page won't actually be recycled by VACUUM early enough for us
> +     * to fail to be able follow its right link (or left link, or downlink),

"for us to fail to be able" sounds odd.

> +    /*
> +     * No ExclusiveLock held case -- why it's safe to proceed.
> +     *
> +     * Problem:
> +     *
> +     * We must avoid false positive reports of corruption when caller treats
> +     * item returned here as an upper bound on target's last item.  In general,
> +     * false positives are disallowed.  Ensuring they don't happen in
> +     * !exclusivelock case is subtle.

s/item/an item/?


This paragraph is too complicated to follow in the noisy environment
where we both currently are ;)

> +/*
> + * bt_downlink_check()
> + *
> + * Checks one of target's downlink against its child page.

downlinks?

> + * Conceptually, the target page continues to be what is checked here.  The
> + * target block is still blamed in the event of finding an invariant violation.
> + * The downlink insertion into the target is probably where any problem raised
> + * here arises, and there is no such thing as a parent link, so doing the
> + * verification this way around is much more practical.
> + */
> +static void
> +bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
> +                  ScanKey targetkey)
> +{

> +     * between the deleted page and its right sibling.  (We cannot delete a
> +     * parent page's rightmost page unless it is the last child page, and we
> +     * intend to delete the parent itself.)

"parent page's rightmost page"?


I like this. Some of the more complex pieces towards the end of the
field need some attention, there's a fair amount of word-smithing
needed, and I do think we want to make the structural changes outlined
above, but besides these, imo fairly simple adaptions, I do think this
is useful and not that far from being committable.


- Andres



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Eisentraut
Дата:
Сообщение: Re: Re: [COMMITTERS] pgsql: Build HTML documentation using XSLT stylesheets by default
Следующее
От: Peter Eisentraut
Дата:
Сообщение: Re: Re: [COMMITTERS] pgsql: Build HTML documentation using XSLT stylesheets by default